[Solved] Training a simple RNN

Hey! Pytorch is amazing and I’m trying to learn how to use it at the moment. I have gotten stuck at training a simple RNN to predict the next value in a time series with a single feature value per timestep.

Here is my net:

class Net(nn.Module):
    def __init__(self,hidden_dim = 128,num_layers=1):
        self.hidden_dim = hidden_dim
        self.lstm = nn.LSTM(input_size=1, hidden_size=hidden_dim, num_layers = num_layers)
        self.dense = nn.Linear(hidden_dim,1)
        self.hidden = self.init_hidden()
    def init_hidden(self):
        return (
    def forward(self,x):
        lstm_out, self.hidden = self.lstm(x,self.hidden)
        linear_out = self.dense(lstm_out)
        return linear_out

And here is my training loop:

opt = optim.Adam(net.parameters())
loss_fn = nn.MSELoss()
for i in range(len(values)-1):
    x = np.array(values[i],dtype=np.float32).reshape((1,1,1))
    x_var = Variable(torch.from_numpy(x)).cuda()
    y = np.array(values[i+1],dtype=np.float32).reshape((1,1,1))
    y_var = Variable(torch.from_numpy(y)).cuda()
    out = net.forward(x_var)
    loss = loss_fn(out,y_var)

I want to be able to do “online training” and thus only want to input 1 timestep at a time. I am getting the following error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I am not calling backward() twice as far as I can see? I must have misunderstood how RNN’s work in pytorch as they were pretty much “plug and play” in keras and you didnt have to hold on to the hidden state.

Thanks in advance!


It seems that you are missing the following line in the loop (based on this tutorial):

# Also, we need to clear out the hidden state of the LSTM,
# detaching it from its history on the last instance.
model.hidden = model.init_hidden()

Once you backward, self.hidden will be passed through, and saved context variables will be freed. If you only want to backward one time step. Store detached version of hidden variable instead.

The hidden layer is initialized in the constructor of the net. I dont want to zero it out before every timestep as I want to keep temporal information in the hidden state of the lstm. The difference is that in that tutorial it seems like they pass in the whole sequence, and not one step at a time and therefore have to zero the hidden state before each new sequence. I am only training on a single sequence. In my code “values” is the timeseries which this time happens to be (1600,1). Resetting the hidden state at each timestep removes the error but the net no longer works as I intended.

I want to backward on each timestep to improve the prediction of the next timestep. I just dont understand where in my code the first backward() is called as when I call it explicitly (loss.backward()) it says that I have already called it once.

I also thought the hidden state returned by the lstm contained the history of previous hidden states? I tried saving the hidden states in a list and sending in the lates hidden state to the lstm in the forward() operation but that also ended with the same error.

Its getting late where I live and I might not be able to reply today but will be back in the thread tomorrow at the latest (morning in NA)

Let me explain using an example. You have y_0, h_0 = f(x_0, 0) at time 0 and y_1, h_1 = f(x_1, h_0) at time 1. At time 0, when you call loss(y_0, real_y_0).backward(), it backtracks through the graph, including some computations for h_0. At time 1, you call loss(y_1, real_y_1).backward(), it backtracks through both x_1 and h_0, both of which are necessary to compute y_1. It is at this time that you backtrack through the graph to compute h_0 twice.

The solution is to save hidden.detach()


Moreover, since you are essentially using it as a RNN cell rather than complete RNN, you may want to look into LSTMCell. You may need to experiment to see which is faster, since LSTM is implemented in cudNN but has overhead.

Thanks for the great example! I am now able to train the network. But I still have a question about detach() and retain_graph:

The current prediction should depend on all previous timesteps. If I dont detach the hidden state and train the network with backward(retain_graph=True) will the updated parameters not be updated to a better value as gradient descent can use the history from previous timesteps? From testing this the results are quite similar but still vary a little bit.
Just trying to clear up a few things in my head. Is the answer for this question different if I would use a simple RNN as it differs a lot from a LSTM?

I can probably research these questions myself but I just dont have enough time at the moment and it bothers me to use something I dont understand.

Thanks in advance!

If you don’t detach the hidden states, then each backward will trace all the way to the beginning of that sequence and update parameter gradients along the way. The gradients will be more accurate (correct gradient wrt full sequence) than backward at each time step. In that case, I suggest you directly feed the sequence data to nn.LSTM module as it does the exactly same thing but much faster.

That said, I still am not sure why you want to backward for each single each time step. It may save you some memory, but it takes the same, if not longer, time to do so. And it gives incorrect gradients.

In the context of this question, RNN and LSTM are mostly the same. They are back’ed by cuDNN and supports directly operating on sequence data.

Thanks again for the quick reply!

The reason I dont feed the whole sequence at the same time to the network is that I want to try this out as “online training” where I would recieve one timestep-value at a time. As the sequence of data I have to evaluate my model on is already complete, would I get the same results if I just passed the whole sequence instead of one timestep at a time (with retain_graph=True in the “one step at a time” version)? I still want the prediction for each timestep so I can evaluate how long it takes for the network before it makes accurate predictions.

In keras I did this by giving the LSTM the argument return_sequences=True. This is information I can probably easily find so no need to answer this if it would take you too much time.

And if it would work just as well to send in the whole sequence I would still have to go back to the model that performs one timestep at a time if I would try it out with online training. But I would rather do this if it is faster than using the “online version” for evaluating my model on complete sequences.

The gradients will be more accurate (correct gradient wrt full sequence)

That said, I still am not sure why you want to backward for each single each time step. It may save you some memory, but it takes the same, if not longer, time to do so. And it gives incorrect gradients.

I dont quite get this part. Do you mean when using detach() instead of retain_graph=True in the second quote? Because then I understand!

Thanks again for all the help!

Yes, it would be the same if you don’t do detach and do retain_graph :slight_smile: .

In terms of the incorrect gradients, I was indeed talking about the detach option. Sorry for the confusion.

Are your training data very very long sequences? If they were multiple sequences, I’d still try bwd on full sequence and see which is faster, because cudNN is very fast. You can still evaluate every several sequences.

I started this thread not understanding at all how RNNs work in pytorch, now everything is clear to me. Thank you a lot!

The sequences are pretty short but I still want to do this the right way and send in the whole sequence instead as it makes for cleaner testing and evaluating.

Im going to mark the thread as solved if I figure out how to.

Hey Adrian,

Would it be possible to see the version of your net after you’ve applied this detach trick?
After reading this topic I’m still not sure how to use it properly…

It was a while since I worked on this so my memory isnt too fresh.

From what I understood detach works like this:

#In this example i have 2 different nets. The input is fed to the first net.
#The output of the first net is fed to the second net.
#I only want to train the second net in this example

net1_input = get_data()
net1_output = net1(net1_input)
#This creates a computation graph from input to output (i.e. all operations performed on the input to acquire the output). This is used when we want to get gradients for the weights we want to improve.
#As we only want to train the second net we dont need the computation graph from the first net. Therefore we detach the output from the previous computation graph
net2_input = net1_output.detach()
net2_output = net2(net2_input)
target = get_target()
loss = loss_function(net2_output,target)
#loss.backward() only computes gradients for the second net as we detached the output of the first net before we fed it to the second net

This is what detach does. When working with LSTM’s it can get a bit confusing to have to handle the hidden and cell states yourself. Lets use the example I used in my first post in this thread:

I feed the network with one timesample at a time. After each timesample I update the weights of the network. If I dont detach the hidden state after each timestep the gradients of the weights will use all previous timesteps when calculated. If i detach the hidden state after each timestep the gradients of the weights will only consider the last timestep when calculated.

Read up on RNN’s to understand what this means.

I didn’t really understand if it was detach you were unsure about or how to use it with RNNs. Hopefully this helps but if you still don’t understand I will try to give you more and clearer examples. Im not always super quick to respond, might take a few days. Good luck!


Thank you Adrian!
I think I’ve got the idea now.


I have a related problem if you don’t mind…

I want to process a sequence (tokens) with LSTM by hand, meaning that I’m going through the sequence with a for-loop instead of giving the whole sequence to LSTM and let it process all-at-once (let’s say).
The reason for doing this is not important (but I can tell it if you are curious).
Also, I want to use character features for each token.

So, in my network I have a first LSTM (charLSTM let’s say) computing character-level representations for tokens.
Such representations are the hidden layer of the charLSTM once it has processed the whole token.
I save character-level representations in a Variable which looks like:

char_rep = autograd.Variable( torch.zeros(sequence_length, batch_size, character_features) )

I fill this variable in a for-loop which looks like:

for i in range(sequence_length):
     char_features = init_char_features()
     lstm_out, char_features = charLSTM(char_input, char_features)     # char_input goes through the whole token
     char_rep[i,:,:] = char_features

So, now I have charcter-features, and I can compute also token features as embeddings, and I can process the sequence using both kind of features.
These two features are given as input to a second LSTM. The hidden state of this second LSTM is then used to compute the final output of the network.
So, I save the hidden state of the LSTM in a similar way as I do for the character-level representations:

hidden_state = autograd.Variable( torch.zeros(sequence_length, batch_size, hidden_features) )

And I fill this variable in a similar way:

for i in range(sequence_length):
     lstm_input = torch.cat( [ word_embeddings[i,:,:], char_rep[i,:,:] ] )
     lstm_out, hidden_features = tokenLSTM(lstm_input, hidden_features)     # process one sequence position, but for the whole batch
     hidden_state[i,:,:] = hidden_features

After the forward step of the network, I call backward on the whole sequence, actually on the whole batch of sequences. I process batch_size sequences at the same time.

I have actually 2 questions:

  1. Should I also detach some hidden state at some point ?
  2. Filling variables as I do, e.g. cha_rep or hidden_state, does break the computation graph ?
    I mean, I know there are issues with in-place Variable operations, that’s why I’m asking this question.

My guess for the 1st question is non. However I’m experiencing a huge memory utilisation, much more than what I expected, and also more than I (roughly) computed.

My guess for the 2nd question is no. However results are not convincing to me. I’m trying to replicate networks I already coded in the past with other frameworks. I would like to move to PyTorch because I think that would be much faster. But actually at the moment I’m not faster, and results are actually much worse than what I got with the other framework. However the latter may be due to my own mistakes in coding the network.

Any answer would be appreciated.
Thank you in advance.

A couple questions before I can answer your concerns.

  1. Are you reusing the same charLSTM in both places?
  2. The Is charLSTM an RNN class or an RNNCell class?
  3. The torch.cat( [ word_embeddings[i,:,:], char_rep[i,:,:] ] ) call will concat along dim 0. Depending on the class of your charLSTM, this has different effects. Are you sure that this is what you want?

Ah, sorry:

  1. No, I did copy&paste. They are 2 different LSTMs
  2. LSTMs are both LSTMs (nn.LSTM)
  3. I did not put too much details, torch.cat is performed “correctly” so that the concatenated number of features match the expected input-size of the second LSTM. so basically,
torch.cat( [ word_embeddings[i,:,:], char_rep[i,:,:] ] )

is performed so that the result is 1 x batch_size x (<word_embedding-dim.> + <char_features-dim.>)

That makes sense ?
Thank you

Sorry, may I ask what version you are using? I’m fairly certain that torch.cat concatenates along dim 0 by default as I just checked doc for master, 0.3 and 0.2, and tested on master.

Sorry, I should have explained more clearly.
I actually do not concatenate along dimension 0, I concatenate along dimension 2.
I actually perform:

torch.cat( [word_embeds[i,:,:].view(1, batch_size, -1), char_rep[i,:,:].view(1, batch_size, -1)], 2 )

because all concatenated Variables are actually 3-dimensional, and I want to concatenate along the features dimension (the other two are sequence_length and batch_size).