Sigmoid vs tanh vs relu

I was doing some experiment with GRU. This is the default code of GRU.

def initialize_gru(self,input_size,hidden_size):
    self.input_reset = nn.Linear(input_size, hidden_size)
    self.input_update = nn.Linear(input_size, hidden_size)
    self.input_new_gate = nn.Linear(input_size, hidden_size)

    self.hid_reset = nn.Linear(hidden_size, hidden_size)
    self.hid_update = nn.Linear(hidden_size, hidden_size)
    self.hid_new_gate = nn.Linear(hidden_size, hidden_size)

def gru_forward(self, inp, last_hid):
    r_t = torch.sigmoid(self.input_reset(inp)+ self.hid_reset(last_hid))
    z_t = torch.sigmoid(self.input_update(inp)+ self.hid_update(last_hid))
    n_t = torch.tanh(self.input_new_gate(inp) + r_t*self.hid_new_gate(last_hid))
    h_t = (1 - z_t) *n_t + z_t*last_hid
    return h_t

In the function “gru_forward” there are 2 sigmoids and 1 tanh ( sigmoid, sigmoid, tanh in order ). I was experimenting with these functions and found that if i replace the sigmoids with tanh at both places (all 3 tanh) then the network doesn’t learn (loss becomes nan) . Same happens if I replace sigmoids with relu (relu, relu, tanh ). if I put sigmoids at all three places, again the network doesn’t learn. But if I replace tanh by relu (sigmoid , sigmoid, relu), learning is even faster than default.
My question is how to know that which of these are going to work? Can I get any intuition?


Deep Learning black magic? :smiley:
I don’t know the answer but does the paper that introduces the GRU model offer more details?

I read that paper ( I can’t see any explanation for the activation functions. :pensive:

1 Like

Hi Anuj!

This is more of a side comment than a direct answer:

Note that pytorch’s sigmoid() is the logistic function, and that is
a rescaled and shifted version of tanh(). Given that the weights
in Linear layers do scaling and their biases do shifts, you would
expect the two versions of your network to train to points where
sigmoid() and tanh() act essentially equivalently.

I would speculate that your network (together with its training data)
is close to being unstable, so that making the seemingly irrelevant
change from sigmoid() to tanh() is enough to kick it into an
unstable regime.

What happens if you use a plain-vanilla SGD optimizer and/or lower
your learning rate?

relu() is different in character from sigmoid() / tanh(), but, even
so, I wouldn’t expect a mixture of relu() and tanh() to break your
training unless your network were already close to being unstable.


K. Frank

sigmoid gates enforce convex/conic combinations (for RNNs, for values from two consecutive timesteps), with tanh() it is basically not a gate anymore (but a source of oscillations)

as for relu, it is similarly non-gated rnn design.

Hello Alex!

Yes, I agree. Your comment explains what is going on and the choice
of sigmiod().

In particular, with sigmoid():

z_t = torch.sigmoid(self.input_update(inp)+ self.hid_update(last_hid))

z_t is forced to be between 0 and 1, so that:

h_t = (1 - z_t) *n_t + z_t*last_hid

is, as you say, a convex combination of n_t and last_hid.

With tanh(), z_t can become negative, so that (assuming that h_t
becomes the new last_hid) last_hid can oscillate in sign.

With relu(), z_t can become larger than 1 so that last_hid can
grow exponentially (and also, (1 - z_t) can become negative).

Thanks for emphasizing the iterative structure of this use case.


K. Frank