In the function “gru_forward” there are 2 sigmoids and 1 tanh ( sigmoid, sigmoid, tanh in order ). I was experimenting with these functions and found that if i replace the sigmoids with tanh at both places (all 3 tanh) then the network doesn’t learn (loss becomes nan) . Same happens if I replace sigmoids with relu (relu, relu, tanh ). if I put sigmoids at all three places, again the network doesn’t learn. But if I replace tanh by relu (sigmoid , sigmoid, relu), learning is even faster than default.
My question is how to know that which of these are going to work? Can I get any intuition?
This is more of a side comment than a direct answer:
Note that pytorch’s sigmoid() is the logistic function, and that is
a rescaled and shifted version of tanh(). Given that the weights
in Linear layers do scaling and their biases do shifts, you would
expect the two versions of your network to train to points where sigmoid() and tanh() act essentially equivalently.
I would speculate that your network (together with its training data)
is close to being unstable, so that making the seemingly irrelevant
change from sigmoid() to tanh() is enough to kick it into an
unstable regime.
What happens if you use a plain-vanilla SGD optimizer and/or lower
your learning rate?
relu() is different in character from sigmoid() / tanh(), but, even
so, I wouldn’t expect a mixture of relu() and tanh() to break your
training unless your network were already close to being unstable.
sigmoid gates enforce convex/conic combinations (for RNNs, for values from two consecutive timesteps), with tanh() it is basically not a gate anymore (but a source of oscillations)
as for relu, it is similarly non-gated rnn design.