When to initialize LSTM hidden state?

My questions might be too dump for advanced users, sorry in advance.

  1. In the example tutorials like word_language_model or time_sequence_prediction etc. States of lstm/rnn initialized at each epoch:
    hidden = model.init_hidden(args.batch_size)

I tried to remove these in my code and it still worked the same. So, when do we actually need to initialize the states of lstm/rnn?

  1. Let say I want to use different batch sizes in train, validation and test times. I want to use large batch size in training time to speed up the learning and small batch size in the test time as the number of validation samples are small. Is this okay thing to do or should I fix batch size during training and validation? I believed it should be okay but, I got worse results in validation when I changed batch size going from training to validation.


  1. because otherwise it will reinitialize hidden layer to zeros (which means it does not remember across time steps). It works but performance wont be as good.

  2. Yes this is okay to do but see if you have to adjust learning rate.


I am somewhat confused over here.
Shouldn’t RNNs retain the weights that were learnt, just like the linear layers do?
The language model example initializes the weights even while evaluating the learnt model.

yes the RNNs retain the learn’t weights. But RNNs also have a hidden state that is not learn’t but transferred between timestep to timestep.


Thanks for the response. That exactly what I thought initially, but then this snippet got me confused:

    def init_hidden(self, bsz):
    weight = next(self.parameters()).data
    if self.rnn_type == 'LSTM':
        return (Variable(weight.new(self.nlayers, bsz, self.nhid).zero_()),
                Variable(weight.new(self.nlayers, bsz, self.nhid).zero_()))
        return Variable(weight.new(self.nlayers, bsz, self.nhid).zero_())

From what I understand, self.parameters() returns all the model parameters, and then gets the data from it.
But the part Variable(weight.new(self.nlayers, bsz, self.nhid).zero_()) seems to set all weights to zeros.

Also since the code is iterating over parameters, shouldn’t it be setting parameters on one layer at a time. But here it seems that all model parameters are set to zero at the same timem (self.nlayers, bsz, self.nhid)

I am sure I am missing some major link over here.

no it creates a new Tensor with the same type as weight. It doesn’t actually touch weight itself.


Why is that step needed, and how does it help?
Isn’t this step supposed to change the hidden states?

And there is one more confusion.
The main function calls init_hidden() as

hidden = model.init_hidden(eval_batch_size)

Now going by definition of init_hidden, it creates variables of type weight for all parameters associated with the model.
But in the main function init_hidden is used to update only hidden states. Shouldn’t this create size mismatch?

Apologies for all the questions, but I am quite new to pytorch and am probably missing something very basic.

This function init_hidden() doesn’t initialize weights, it creates new initial states for new sequences. There’s initial state in all RNNs to calculate hidden state at time t=1. You can check size of this hidden variable to confirm this.


did you mean to say that the hidden states are not retained from one sequence to another if its not explicitly handled.
How can it not retain the hidden states between timesteps within a sequence. Isnt it pretty essential for the RNN to work in any setting

1 Like

The init_hidden function sets the hidden variables to zeros. So I don’t understand your first point.

1 Like

I think you need to watch the doc example torch.nn.LSTM to get a better explanation.

I don’t understand the previous comment. What about it should I be looking at?

I think the line

weight = next(self.parameters()).data

is only used to get the type of the parameters and then a tensor (of the same type), filled with zeros is returned to initialize the weights to zeros (for each epoch).

I think the code for initializing the hidden states shown above does not make sense, because this is exactly what the LSTM does anyway if you only pass the inputs. In that case, the initial hidden states and cells are set to all zeroes for all sequences in the batch.

The only time one would need code like this is if one wants to do something different to initializing the hidden states to zeros for each sequence, e.g. initialize randomly, try to learn initial hidden states, use the last states from the previous sequence etc.

So having this code in a tutorial code is rather odd, in my opinion.

Or am I mistaken here?


I agree with you, according to the documentation If (h_0, c_0) is not provided, both h_0 and c_0 default to zero. https://pytorch.org/docs/stable/nn.html#lstm
So the hidden state and the cell state reset to zero for every epoch regardless, you don’t have to pass any initial states unless you are initializing them to something else.

In my opinion, it only makes sense to have those in the LSTMCell class since we init the hidden and cell states with the previous cell’s hidden and cell states.


I’m also very confused by this init_hidden method. I am trying to implement something similar so I can set the hidden state of a sequence model. A lot of the discussion here as well as the code for init_hidden uses the words ‘parameters’ and ‘weights’. The hidden states of an RNN are not parameters or weights. Can someone clear up the terminology and then confirm what init_hidden does? Shouldn’t the hidden state of an RNN be a buffer, not a parameter?

I completely disagree with point 1 and is misleading.
Its an optional argument and has got nothing to do with remembering between the time steps.

Hope this link helps: Initialization of first hidden state in LSTM and truncated BPTT.

I guess the difference is that in the tutorial they are initializing the current mini-batch’s hidden state as the previous mini-batch’s hidden state, which is why we need to detach them, otherwise we would unroll the rnn until the first data point as mentioned. If we are initializing each batch’s hidden state as zeros then we don’t need to do the detaching.

Hidden and weights are different here, If you go through LSTM equations you will get some maths–
LSTM gates(update/ouput etc) = Fun1(W1 * input + B1 + W2 * Hidden (t-1) + B2)
Cell(t) = Fun(Cell (t-1), LSTM gates)
Hidden (t) = Fun3( Cell(t), LSTM gate)
If my first sequence Seq1 = tok11 + tok12 + tok13
and my second Seq2 = tok21 + tok22 + tok23
If I process tok11 or tok21 here Hidden(t-1) is nothing and initialize to zero and it has nothing to do with Weights or biases. If I process tok12 then hidden outcome from previous token tok11 will be my H(t-1) and this is already calculated. I can not pass tok13 (last token of Seq1) hidden outcome to my first token of Seq2 which is tok21, that is the reason for every new sequence Hidden state set to zero.