CUDA out of memory when using retain_graph=True


I’m working on a RNN at the moment, however the retain_graph option is consuming all of my gpu memory eventually. And training seems to get slower every epoch.

def learn(X, y, hidden):
  output, hidden = model(X, hidden)
  loss = criterion(output, y)
  return loss, output, hidden

However, when I don’t specify retain_graph=True I get the following error: “RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time”. And I notice this only happens when I keep the hidden value, if my hidden is always None, then it works just fine.

Is there a way around this?

1 Like


When computing the gradients with the backward call, pytorch automatically free the computation graph use to create all the variables, and only store the gradients on the parameters just to perform the update (intermediate values are deleted).

In your case what I guess is happening is that after computing the derivatives of your criterion you use the hidden state in order to compute the cost at time t+1, so when you call backward again on this cost pytorch does not now how to backtrace. In a RNN it is natural to compute the cost in this way as you have to keep track of the recurrence.

What you might do is free the memory when the epoch finishes. Maybe if you put the code of your main loop I can suggest you a modification.

1 Like

Thanks for the reply, @jmaronas. How can I free the memory manually?
My main loop is as follow:

hidden = None

for epoch in range(epochs):
  predictions = []
  true_values = []
  loss_avg = 0
  progress_bar = tqdm_notebook(dataloader)
  for i, (X,y) in enumerate(progress_bar):
    progress_bar.set_description('Epoch ' + str(epoch))
    X =
    y =
    loss, output, hidden = learn(X,y, hidden)
    loss_avg += loss

where is your recurrence step defined?

Your code explotes because of loss_avg+=loss If you do not free the buffer (retain_graph=True, but you have to set it to True because you need it to compute the recurrence gradient), then all is stored in loss_avg. Take in account that loss, in your case, is not only the crossentropy or whatever, it is everything you use to compute it. If you want to keep track of the scalar value that represents your accumulate loss you can do (though the use of .data is deprecated, for this cases, I still find it useful, clean and simple). This will only store the actual scalar value.

Anyway I think your code should look something like:

for e in range(epochs):
      for idx,(x,t) in enumerate(data_loader):#x should be 3 dimensional (recurrence,samples,dimension) if your network is fully connected, else 4 dimensional (time_step,batch,rows,cols)
          for t in range(Time_steps):


My model is a nn.LSTM connected to a nn.Linear. It is defined as follow:

import torch.nn as nn

class RNN(nn.Module):
  #  input of shape (seq_len, batch, input_size)
    def __init__(self, input_size, hidden_size, output_size = 1, num_layers = 1, batch_size = 1):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.batch_size = batch_size
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers)
        self.mlp = nn.Linear(hidden_size, output_size)

    def forward(self, X, H=None):
        out, hidden = self.lstm(X,H)
        out = self.mlp(out[-1])
        return out, hidden

I thought the nn.LSTM module already took care of the recurrence step, since the documentation for nn.lstm say: output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t)from the last layer of the LSTM, for each t.

Am I missing something?

Okei, if you use the nn.LSTM() you have to call .backward() with retain_graph=True so pytorch can backpropagate through time and then call optimizer.step(). Your problem is then when accumulating the loss for printing (monitoring or whatever). Just do because if not you will be storing all the computation graphs from all the epochs. As the graph has not been free during backward call you have to do it in this way to only keep track of the scalar value representing the cost.


Thanks for the help :slight_smile:. So I don’t need to free the memory manually? Anyhow I’d like to know how to do so properly, if you could give me some reference, haha. Thanks!

You can just do what I told you.

Option 1:

Option 2:

with torch.no_grad():


Take a look at the autograd documentation to see what does the torch tensor class stores