Question on Pytorch Tutorials about RNN and LSTM

In the part of “Sequence Models and Long-Short Term Memory Networks”, theres cods like this:

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance

        # Also, we need to clear out the hidden state of the LSTM,
        # detaching it from its history on the last instance.
        model.hidden = model.init_hidden()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

My question is, If every time when u train on a new batch or a new sequence training example, then if u clean the hidden state which means u just give up the parameters u have trained about. Then in the next epoch, u are doing repeat meaningless things, just like the epoch before cause ur paramters are zero again. So I really cant understand why here we use “model.hidden = model.init_hidden()”

Can anyone help me out? I have been considering this question for a whole day although I know im wrong and it seems really stupid hahaha…

Thanks in advance!

simply because hidden state and cell state are not learned parameters.

Im sorry what 'learned parameters" here u mean? In my understanding we are trying to learning hidden state and cell state right?

No. Please take a look at the documentation to see the lstm or gru equations. What is learned are the weights. Hidden/cell states are calculated from previous time step’s hidden/cell state and the input at the current time step. So, when you run lstm over a sequence, you have the define the initial hidden/cell states.

1 Like

Ohhhh u are right hahahaha, I made such a stupid mistake, yesyes we are learning the parameters which will give us the result of hidden state and cell states, but ok whats the difference if we dont call “model.zero_grad()” and “model.hidden = model.init_hidden()”, I gave a try in the evening, it gives me results like “Nun”, is it because that the net is being too large?

I think my new question should be expressed like this: Since hidden state and cell state are not learned parameters, then why do we have to clear gradients and init hidden states before we train the next batch or next example? What does the following sentence mean? Im sorry cause I used to use TF or just coding with numpy so im not so familiar with pytorch, Thanks again!

Essentially, after you have calculated the gradients via backward and updated the weights, you need to clear the gradients. Otherwise, pytorch will keep accumulating gradients. You can find more details here: and you can check the following discussion: Why do we need to set the gradients manually to zero in pytorch?

Thank u so much!! I will check this out :slight_smile: