[Solved] Training a simple RNN

Thanks for the great example! I am now able to train the network. But I still have a question about detach() and retain_graph:

The current prediction should depend on all previous timesteps. If I dont detach the hidden state and train the network with backward(retain_graph=True) will the updated parameters not be updated to a better value as gradient descent can use the history from previous timesteps? From testing this the results are quite similar but still vary a little bit.
Just trying to clear up a few things in my head. Is the answer for this question different if I would use a simple RNN as it differs a lot from a LSTM?

I can probably research these questions myself but I just dont have enough time at the moment and it bothers me to use something I dont understand.

Thanks in advance!

If you don’t detach the hidden states, then each backward will trace all the way to the beginning of that sequence and update parameter gradients along the way. The gradients will be more accurate (correct gradient wrt full sequence) than backward at each time step. In that case, I suggest you directly feed the sequence data to nn.LSTM module as it does the exactly same thing but much faster.

That said, I still am not sure why you want to backward for each single each time step. It may save you some memory, but it takes the same, if not longer, time to do so. And it gives incorrect gradients.

In the context of this question, RNN and LSTM are mostly the same. They are back’ed by cuDNN and supports directly operating on sequence data.

1 Like

Thanks again for the quick reply!

The reason I dont feed the whole sequence at the same time to the network is that I want to try this out as “online training” where I would recieve one timestep-value at a time. As the sequence of data I have to evaluate my model on is already complete, would I get the same results if I just passed the whole sequence instead of one timestep at a time (with retain_graph=True in the “one step at a time” version)? I still want the prediction for each timestep so I can evaluate how long it takes for the network before it makes accurate predictions.

In keras I did this by giving the LSTM the argument return_sequences=True. This is information I can probably easily find so no need to answer this if it would take you too much time.

And if it would work just as well to send in the whole sequence I would still have to go back to the model that performs one timestep at a time if I would try it out with online training. But I would rather do this if it is faster than using the “online version” for evaluating my model on complete sequences.

The gradients will be more accurate (correct gradient wrt full sequence)

That said, I still am not sure why you want to backward for each single each time step. It may save you some memory, but it takes the same, if not longer, time to do so. And it gives incorrect gradients.

I dont quite get this part. Do you mean when using detach() instead of retain_graph=True in the second quote? Because then I understand!

Thanks again for all the help!

Yes, it would be the same if you don’t do detach and do retain_graph :slight_smile: .

In terms of the incorrect gradients, I was indeed talking about the detach option. Sorry for the confusion.

Are your training data very very long sequences? If they were multiple sequences, I’d still try bwd on full sequence and see which is faster, because cudNN is very fast. You can still evaluate every several sequences.

I started this thread not understanding at all how RNNs work in pytorch, now everything is clear to me. Thank you a lot!

The sequences are pretty short but I still want to do this the right way and send in the whole sequence instead as it makes for cleaner testing and evaluating.

Im going to mark the thread as solved if I figure out how to.

Hey Adrian,

Would it be possible to see the version of your net after you’ve applied this detach trick?
After reading this topic I’m still not sure how to use it properly…

It was a while since I worked on this so my memory isnt too fresh.

From what I understood detach works like this:

#In this example i have 2 different nets. The input is fed to the first net.
#The output of the first net is fed to the second net.
#I only want to train the second net in this example

net1_input = get_data()
net1_output = net1(net1_input)
#This creates a computation graph from input to output (i.e. all operations performed on the input to acquire the output). This is used when we want to get gradients for the weights we want to improve.
#As we only want to train the second net we dont need the computation graph from the first net. Therefore we detach the output from the previous computation graph
net2_input = net1_output.detach()
net2_output = net2(net2_input)
target = get_target()
loss = loss_function(net2_output,target)
#loss.backward() only computes gradients for the second net as we detached the output of the first net before we fed it to the second net

This is what detach does. When working with LSTM’s it can get a bit confusing to have to handle the hidden and cell states yourself. Lets use the example I used in my first post in this thread:

I feed the network with one timesample at a time. After each timesample I update the weights of the network. If I dont detach the hidden state after each timestep the gradients of the weights will use all previous timesteps when calculated. If i detach the hidden state after each timestep the gradients of the weights will only consider the last timestep when calculated.

Read up on RNN’s to understand what this means.

I didn’t really understand if it was detach you were unsure about or how to use it with RNNs. Hopefully this helps but if you still don’t understand I will try to give you more and clearer examples. Im not always super quick to respond, might take a few days. Good luck!


Thank you Adrian!
I think I’ve got the idea now.


I have a related problem if you don’t mind…

I want to process a sequence (tokens) with LSTM by hand, meaning that I’m going through the sequence with a for-loop instead of giving the whole sequence to LSTM and let it process all-at-once (let’s say).
The reason for doing this is not important (but I can tell it if you are curious).
Also, I want to use character features for each token.

So, in my network I have a first LSTM (charLSTM let’s say) computing character-level representations for tokens.
Such representations are the hidden layer of the charLSTM once it has processed the whole token.
I save character-level representations in a Variable which looks like:

char_rep = autograd.Variable( torch.zeros(sequence_length, batch_size, character_features) )

I fill this variable in a for-loop which looks like:

for i in range(sequence_length):
     char_features = init_char_features()
     lstm_out, char_features = charLSTM(char_input, char_features)     # char_input goes through the whole token
     char_rep[i,:,:] = char_features

So, now I have charcter-features, and I can compute also token features as embeddings, and I can process the sequence using both kind of features.
These two features are given as input to a second LSTM. The hidden state of this second LSTM is then used to compute the final output of the network.
So, I save the hidden state of the LSTM in a similar way as I do for the character-level representations:

hidden_state = autograd.Variable( torch.zeros(sequence_length, batch_size, hidden_features) )

And I fill this variable in a similar way:

for i in range(sequence_length):
     lstm_input = torch.cat( [ word_embeddings[i,:,:], char_rep[i,:,:] ] )
     lstm_out, hidden_features = tokenLSTM(lstm_input, hidden_features)     # process one sequence position, but for the whole batch
     hidden_state[i,:,:] = hidden_features

After the forward step of the network, I call backward on the whole sequence, actually on the whole batch of sequences. I process batch_size sequences at the same time.

I have actually 2 questions:

  1. Should I also detach some hidden state at some point ?
  2. Filling variables as I do, e.g. cha_rep or hidden_state, does break the computation graph ?
    I mean, I know there are issues with in-place Variable operations, that’s why I’m asking this question.

My guess for the 1st question is non. However I’m experiencing a huge memory utilisation, much more than what I expected, and also more than I (roughly) computed.

My guess for the 2nd question is no. However results are not convincing to me. I’m trying to replicate networks I already coded in the past with other frameworks. I would like to move to PyTorch because I think that would be much faster. But actually at the moment I’m not faster, and results are actually much worse than what I got with the other framework. However the latter may be due to my own mistakes in coding the network.

Any answer would be appreciated.
Thank you in advance.

A couple questions before I can answer your concerns.

  1. Are you reusing the same charLSTM in both places?
  2. The Is charLSTM an RNN class or an RNNCell class?
  3. The torch.cat( [ word_embeddings[i,:,:], char_rep[i,:,:] ] ) call will concat along dim 0. Depending on the class of your charLSTM, this has different effects. Are you sure that this is what you want?

Ah, sorry:

  1. No, I did copy&paste. They are 2 different LSTMs
  2. LSTMs are both LSTMs (nn.LSTM)
  3. I did not put too much details, torch.cat is performed “correctly” so that the concatenated number of features match the expected input-size of the second LSTM. so basically,
torch.cat( [ word_embeddings[i,:,:], char_rep[i,:,:] ] )

is performed so that the result is 1 x batch_size x (<word_embedding-dim.> + <char_features-dim.>)

That makes sense ?
Thank you

Sorry, may I ask what version you are using? I’m fairly certain that torch.cat concatenates along dim 0 by default as I just checked doc for master, 0.3 and 0.2, and tested on master.

Sorry, I should have explained more clearly.
I actually do not concatenate along dimension 0, I concatenate along dimension 2.
I actually perform:

torch.cat( [word_embeds[i,:,:].view(1, batch_size, -1), char_rep[i,:,:].view(1, batch_size, -1)], 2 )

because all concatenated Variables are actually 3-dimensional, and I want to concatenate along the features dimension (the other two are sequence_length and batch_size).

I see. Thanks for the additional information. What you do seems correct. Although it is not the most efficient speed-wise, the performance should not degenerate. So I’m curious if you can provide the full script. A reference script in tf/keras/theano/etc would be helpful as well.

For your second question, it doesn’t break autograd because you don’t need the overwritten values to compute any gradients. Same reasoning applies to in-place relu etc.


OK, thank you for taking the time to understand and answer this.

I will clean up the code and put it here.
In the meanwhile you can take a look at the bottom of this discussion:

The script there is a bit different and you have the reason why I want to process sequences by hand: I want to re-inject predicted labels as input to the network, and embed them just like words.
So together with word embeddings and character features, the input to the hidden layer contains also label embeddings.

I did few days ago, I had no answer, I guess that’s because the discussion is a bit old…
Hope it will be clear enough, otherwise I will come back soon with the cleaned script.

Thank you in advance in any case

Hi again,

I didn’t finish yet (sorry, I have also other things for my job), however cleaning the code I thought this:

char_rep and hidden_state Variables, which are used for keeping character-level representations and hidden states that are used later to compute the network output, are local variables of the forward method of the network.

So, after the forward call normally they go out of scope, isn’t it ?
If this is right, I don’t why I’m not getting any error, it could explain the poor results: back-propagation is not actually updating all the weights up to the character embeddings.
This maybe will not solve my memory problem, but it could be the explanation for results.

See you soon

You are right in that they go out of scope. However, there are still references to them from the computation graph, which is not explicitly shown. So they are not deallocated/gc’ed.

By the way, how did you init char_input?

Regarding to your memory issue, is the memory usage increasing every iteration? Or is it just that the script takes a large constant amount of memory?

If it is the latter, it could be related to how you store the char_hidden as a class attribute, self.char_hidden, retaining references to the graph. But it shouldn’t matter if there are backward calls that reach this part of the graph, as backward frees the graph…

Sorry I didn’t look into the details of the code in the post you linked. Knowing the answer to my question above should help us determine where we should look to find the root of cause :slight_smile:

Btw, the conversation is getting quite long. Feel free to start a new thread so we don’t disturb others with notifications.

OK, thanks.

I indeed answered to your questions in a new thread:

I was facing the same problem myself. Took me a while to find this thread but it is all clear now. Thanks!