I was doing some experiment with GRU. This is the default code of GRU.

```
def initialize_gru(self,input_size,hidden_size):
self.input_reset = nn.Linear(input_size, hidden_size)
self.input_update = nn.Linear(input_size, hidden_size)
self.input_new_gate = nn.Linear(input_size, hidden_size)
self.hid_reset = nn.Linear(hidden_size, hidden_size)
self.hid_update = nn.Linear(hidden_size, hidden_size)
self.hid_new_gate = nn.Linear(hidden_size, hidden_size)
def gru_forward(self, inp, last_hid):
r_t = torch.sigmoid(self.input_reset(inp)+ self.hid_reset(last_hid))
z_t = torch.sigmoid(self.input_update(inp)+ self.hid_update(last_hid))
n_t = torch.tanh(self.input_new_gate(inp) + r_t*self.hid_new_gate(last_hid))
h_t = (1 - z_t) *n_t + z_t*last_hid
return h_t
```

In the function â€śgru_forwardâ€ť there are 2 sigmoids and 1 tanh ( sigmoid, sigmoid, tanh in order ). I was experimenting with these functions and found that if i replace the sigmoids with tanh at both places (all 3 tanh) then the network doesnâ€™t learn (loss becomes nan) . Same happens if I replace sigmoids with relu (relu, relu, tanh ). if I put sigmoids at all three places, again the network doesnâ€™t learn. But if I replace tanh by relu (sigmoid , sigmoid, relu), learning is even faster than default.

My question is how to know that which of these are going to work? Can I get any intuition?