When to initialize LSTM hidden state?


My questions might be too dump for advanced users, sorry in advance.

  1. In the example tutorials like word_language_model or time_sequence_prediction etc. States of lstm/rnn initialized at each epoch:
    hidden = model.init_hidden(args.batch_size)

I tried to remove these in my code and it still worked the same. So, when do we actually need to initialize the states of lstm/rnn?

  1. Let say I want to use different batch sizes in train, validation and test times. I want to use large batch size in training time to speed up the learning and small batch size in the test time as the number of validation samples are small. Is this okay thing to do or should I fix batch size during training and validation? I believed it should be okay but, I got worse results in validation when I changed batch size going from training to validation.


  1. because otherwise it will reinitialize hidden layer to zeros (which means it does not remember across time steps). It works but performance wont be as good.

  2. Yes this is okay to do but see if you have to adjust learning rate.

(Parth Mehta) #3

I am somewhat confused over here.
Shouldn’t RNNs retain the weights that were learnt, just like the linear layers do?
The language model example initializes the weights even while evaluating the learnt model.


yes the RNNs retain the learn’t weights. But RNNs also have a hidden state that is not learn’t but transferred between timestep to timestep.

(Parth Mehta) #5

Thanks for the response. That exactly what I thought initially, but then this snippet got me confused:

    def init_hidden(self, bsz):
    weight = next(self.parameters()).data
    if self.rnn_type == 'LSTM':
        return (Variable(weight.new(self.nlayers, bsz, self.nhid).zero_()),
                Variable(weight.new(self.nlayers, bsz, self.nhid).zero_()))
        return Variable(weight.new(self.nlayers, bsz, self.nhid).zero_())

From what I understand, self.parameters() returns all the model parameters, and then gets the data from it.
But the part Variable(weight.new(self.nlayers, bsz, self.nhid).zero_()) seems to set all weights to zeros.

Also since the code is iterating over parameters, shouldn’t it be setting parameters on one layer at a time. But here it seems that all model parameters are set to zero at the same timem (self.nlayers, bsz, self.nhid)

I am sure I am missing some major link over here.


no it creates a new Tensor with the same type as weight. It doesn’t actually touch weight itself.

(Parth Mehta) #7

Why is that step needed, and how does it help?
Isn’t this step supposed to change the hidden states?

And there is one more confusion.
The main function calls init_hidden() as

hidden = model.init_hidden(eval_batch_size)

Now going by definition of init_hidden, it creates variables of type weight for all parameters associated with the model.
But in the main function init_hidden is used to update only hidden states. Shouldn’t this create size mismatch?

Apologies for all the questions, but I am quite new to pytorch and am probably missing something very basic.

(Daria Vazhenina) #8

This function init_hidden() doesn’t initialize weights, it creates new initial states for new sequences. There’s initial state in all RNNs to calculate hidden state at time t=1. You can check size of this hidden variable to confirm this.

(Minesh Mathew) #9

did you mean to say that the hidden states are not retained from one sequence to another if its not explicitly handled.
How can it not retain the hidden states between timesteps within a sequence. Isnt it pretty essential for the RNN to work in any setting


The init_hidden function sets the hidden variables to zeros. So I don’t understand your first point.

(D) #11

I think you need to watch the doc example torch.nn.LSTM to get a better explanation.


I don’t understand the previous comment. What about it should I be looking at?

(Manikbhandari) #13

I think the line

weight = next(self.parameters()).data

is only used to get the type of the parameters and then a tensor (of the same type), filled with zeros is returned to initialize the weights to zeros (for each epoch).