Hidden state initialization for RNNs

shavitamit · November 8, 2017, 8:53pm

When using a standard nn.GRU() for performing RNNs, how should one initialize the hidden state? Currently, I do the following:

def init_hidden(self, batch_size):
    # variable of size [num_layers*num_directions, b_sz, hidden_sz]
    return Variable(torch.zeros(self.num_directions, batch_size, self.hidden_size)).cuda()

This works - however, does this assume the network learns a different hidden state for each element in the batch? Should I somehow be replicating the same initial hidden state for every element in the batch, so that the network doesn’t need to learn batch_sz hidden states, but rather one hidden state?

shavitamit · November 8, 2017, 9:24pm

I don’t think you declare the hidden state assuming no batches, given that the documentation specifies that:

h_0 (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch.

This implies a hidden state of dimensions [1, batch_sz, hidden_size] for a single-layer uni-directional RNN.

sg314 · November 8, 2017, 9:26pm

Right, my mistake. Sorry, I missed that.

SimonW · November 8, 2017, 10:51pm

I’m a bit confused. Why do you want the network to output same hidden state for each data in the batch? If so, there will be no correlation along the time dimension, and you would be essentially predicting current timestep output from current timestep input only. If that is something you want in your use case, maybe try non-recurrent models?

shavitamit · November 9, 2017, 1:32pm

Thanks Simon. The question is more around learning hidden states. Currently I assign zero initial states, and I actually don’t make them volatile so they do have a grad_fn. However, I don’t see the point of that, since I feed a new initial state to every sequence I get (before training the network with a new batch sequence, I call init_hidden() above.

This pdf (by Hinton) mentions that it’s better to actually learn the hidden states of the network, which got me thinking to maybe assigning one hidden state to the network and essentially replicate it across each element in the batch, so it can learn the hidden state.

What do you think?

SimonW · November 9, 2017, 3:03pm

I didn’t read the slides in details, but I still don’t understand why initializing your hidden state to something non-zero (independent of input) would be very different? Within the sequence, the hidden states are still passed through, which is a rather more important thing.

shavitamit · November 9, 2017, 3:14pm

Fair enough - agreed. My question wasn’t around what to initialize the hidden state to, whether zeros or 0.5, but rather whether it’s customary to initialize the hidden state before each sequence like I do above, or whether some people initialize the hidden state once during training and keep evolving it as the network sees more sequences (i.e., the init_hidden() function above would only be called once in the beginning, not before each sequence).

SimonW · November 9, 2017, 10:50pm

You should to do it for each training sequence . The logic is that you probably don’t want to introduce inter-data dependencies on training. If that happens, then what you get from a model depends on what you feed into it before. That is probably something you don’t want, especially in test time.

shavitamit · November 10, 2017, 12:22am

Got it - thanks Simon!

Pratheesh_Kumar · September 21, 2020, 7:00pm

Yeah as you said Why do we need to specify batch_size . Here we are taking about initializing the the initial hidden states to the gru model so isn’t it supposed to be of shape
[no_of_stacked_layer , hidden_size_of_gru]. Why do we need to include the batch_size in the shape . I couldn’t get my head around this. Can anyone clarify my doubts.

ramisyuanbo · August 19, 2022, 1:21am

Hi, since you are providing an all zero initial hidden, why not just pass a None as hidden? that works for me.