Sigmoid vs tanh vs relu

Anuj_Chopra · October 16, 2020, 10:50am

I was doing some experiment with GRU. This is the default code of GRU.

def initialize_gru(self,input_size,hidden_size):
    self.input_reset = nn.Linear(input_size, hidden_size)
    self.input_update = nn.Linear(input_size, hidden_size)
    self.input_new_gate = nn.Linear(input_size, hidden_size)

    self.hid_reset = nn.Linear(hidden_size, hidden_size)
    self.hid_update = nn.Linear(hidden_size, hidden_size)
    self.hid_new_gate = nn.Linear(hidden_size, hidden_size)


def gru_forward(self, inp, last_hid):
    r_t = torch.sigmoid(self.input_reset(inp)+ self.hid_reset(last_hid))
    z_t = torch.sigmoid(self.input_update(inp)+ self.hid_update(last_hid))
    n_t = torch.tanh(self.input_new_gate(inp) + r_t*self.hid_new_gate(last_hid))
    h_t = (1 - z_t) *n_t + z_t*last_hid
    return h_t

In the function “gru_forward” there are 2 sigmoids and 1 tanh ( sigmoid, sigmoid, tanh in order ). I was experimenting with these functions and found that if i replace the sigmoids with tanh at both places (all 3 tanh) then the network doesn’t learn (loss becomes nan) . Same happens if I replace sigmoids with relu (relu, relu, tanh ). if I put sigmoids at all three places, again the network doesn’t learn. But if I replace tanh by relu (sigmoid , sigmoid, relu), learning is even faster than default.
My question is how to know that which of these are going to work? Can I get any intuition?

albanD · October 16, 2020, 2:28pm

Hi,

Deep Learning black magic?
I don’t know the answer but does the paper that introduces the GRU model offer more details?

Anuj_Chopra · October 16, 2020, 3:40pm

I read that paper (https://arxiv.org/pdf/1406.1078.pdf). I can’t see any explanation for the activation functions.

KFrank · October 16, 2020, 10:11pm

Hi Anuj!

This is more of a side comment than a direct answer:

Note that pytorch’s sigmoid() is the logistic function, and that is
a rescaled and shifted version of tanh(). Given that the weights
in Linear layers do scaling and their biases do shifts, you would
expect the two versions of your network to train to points where
sigmoid() and tanh() act essentially equivalently.

I would speculate that your network (together with its training data)
is close to being unstable, so that making the seemingly irrelevant
change from sigmoid() to tanh() is enough to kick it into an
unstable regime.

What happens if you use a plain-vanilla SGD optimizer and/or lower
your learning rate?

relu() is different in character from sigmoid() / tanh(), but, even
so, I wouldn’t expect a mixture of relu() and tanh() to break your
training unless your network were already close to being unstable.

Best.

K. Frank

googlebot · October 17, 2020, 6:02am

sigmoid gates enforce convex/conic combinations (for RNNs, for values from two consecutive timesteps), with tanh() it is basically not a gate anymore (but a source of oscillations)

as for relu, it is similarly non-gated rnn design.

KFrank · October 17, 2020, 2:43pm

Hello Alex!

Yes, I agree. Your comment explains what is going on and the choice
of sigmiod().

In particular, with sigmoid():

z_t = torch.sigmoid(self.input_update(inp)+ self.hid_update(last_hid))

z_t is forced to be between 0 and 1, so that:

h_t = (1 - z_t) *n_t + z_t*last_hid

is, as you say, a convex combination of n_t and last_hid.

With tanh(), z_t can become negative, so that (assuming that h_t
becomes the new last_hid) last_hid can oscillate in sign.

With relu(), z_t can become larger than 1 so that last_hid can
grow exponentially (and also, (1 - z_t) can become negative).

Thanks for emphasizing the iterative structure of this use case.

Best.

K. Frank