I was doing some experiment with GRU. This is the default code of GRU.
def initialize_gru(self,input_size,hidden_size): self.input_reset = nn.Linear(input_size, hidden_size) self.input_update = nn.Linear(input_size, hidden_size) self.input_new_gate = nn.Linear(input_size, hidden_size) self.hid_reset = nn.Linear(hidden_size, hidden_size) self.hid_update = nn.Linear(hidden_size, hidden_size) self.hid_new_gate = nn.Linear(hidden_size, hidden_size) def gru_forward(self, inp, last_hid): r_t = torch.sigmoid(self.input_reset(inp)+ self.hid_reset(last_hid)) z_t = torch.sigmoid(self.input_update(inp)+ self.hid_update(last_hid)) n_t = torch.tanh(self.input_new_gate(inp) + r_t*self.hid_new_gate(last_hid)) h_t = (1 - z_t) *n_t + z_t*last_hid return h_t
In the function “gru_forward” there are 2 sigmoids and 1 tanh ( sigmoid, sigmoid, tanh in order ). I was experimenting with these functions and found that if i replace the sigmoids with tanh at both places (all 3 tanh) then the network doesn’t learn (loss becomes nan) . Same happens if I replace sigmoids with relu (relu, relu, tanh ). if I put sigmoids at all three places, again the network doesn’t learn. But if I replace tanh by relu (sigmoid , sigmoid, relu), learning is even faster than default.
My question is how to know that which of these are going to work? Can I get any intuition?