CUDA out of memory when using retain_graph=True

MHertzog · April 1, 2019, 8:43pm

Hi,

I’m working on a RNN at the moment, however the retain_graph option is consuming all of my gpu memory eventually. And training seems to get slower every epoch.

def learn(X, y, hidden):
  model.zero_grad()
  
  output, hidden = model(X, hidden)
  
  loss = criterion(output, y)
  
  optimizer.zero_grad()
  loss.backward(retain_graph=True)
  optimizer.step()
  
  return loss, output, hidden

However, when I don’t specify retain_graph=True I get the following error: “RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time”. And I notice this only happens when I keep the hidden value, if my hidden is always None, then it works just fine.

Is there a way around this?

jmaronas · April 1, 2019, 8:51pm

Hi.

When computing the gradients with the backward call, pytorch automatically free the computation graph use to create all the variables, and only store the gradients on the parameters just to perform the update (intermediate values are deleted).

In your case what I guess is happening is that after computing the derivatives of your criterion you use the hidden state in order to compute the cost at time t+1, so when you call backward again on this cost pytorch does not now how to backtrace. In a RNN it is natural to compute the cost in this way as you have to keep track of the recurrence.

What you might do is free the memory when the epoch finishes. Maybe if you put the code of your main loop I can suggest you a modification.

MHertzog · April 1, 2019, 8:59pm

Thanks for the reply, @jmaronas. How can I free the memory manually?
My main loop is as follow:

model.train()
hidden = None

for epoch in range(epochs):
  
  predictions = []
  true_values = []
  loss_avg = 0
  progress_bar = tqdm_notebook(dataloader)
  
  for i, (X,y) in enumerate(progress_bar):
    
    progress_bar.set_description('Epoch ' + str(epoch))
    
    X = X.to(device)
    y = y.to(device)
    
    loss, output, hidden = learn(X,y, hidden)
    
    
    loss_avg += loss
      
    predictions.append(output.item())
    true_values.append(y.item())

jmaronas · April 1, 2019, 9:13pm

where is your recurrence step defined?

Your code explotes because of loss_avg+=loss If you do not free the buffer (retain_graph=True, but you have to set it to True because you need it to compute the recurrence gradient), then all is stored in loss_avg. Take in account that loss, in your case, is not only the crossentropy or whatever, it is everything you use to compute it. If you want to keep track of the scalar value that represents your accumulate loss you can do loss_avg+=loss.data (though the use of .data is deprecated, for this cases, I still find it useful, clean and simple). This will only store the actual scalar value.

Anyway I think your code should look something like:

for e in range(epochs):
      for idx,(x,t) in enumerate(data_loader):#x should be 3 dimensional (recurrence,samples,dimension) if your network is fully connected, else 4 dimensional (time_step,batch,rows,cols)
          for t in range(Time_steps):
                 ...

MHertzog · April 1, 2019, 9:23pm

My model is a nn.LSTM connected to a nn.Linear. It is defined as follow:

import torch.nn as nn

class RNN(nn.Module):
  #
  #  input of shape (seq_len, batch, input_size)
  #
  
    def __init__(self, input_size, hidden_size, output_size = 1, num_layers = 1, batch_size = 1):
        super(RNN, self).__init__()
        
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.batch_size = batch_size
        
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers)
        
        self.mlp = nn.Linear(hidden_size, output_size)

    def forward(self, X, H=None):
      
        
        out, hidden = self.lstm(X,H)
        
        out = self.mlp(out[-1])
        
        return out, hidden

I thought the nn.LSTM module already took care of the recurrence step, since the documentation for nn.lstm say: output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t)from the last layer of the LSTM, for each t.

Am I missing something?

jmaronas · April 1, 2019, 9:29pm

Okei, if you use the nn.LSTM() you have to call .backward() with retain_graph=True so pytorch can backpropagate through time and then call optimizer.step(). Your problem is then when accumulating the loss for printing (monitoring or whatever). Just do loss_avg+=loss.data because if not you will be storing all the computation graphs from all the epochs. As the graph has not been free during backward call you have to do it in this way to only keep track of the scalar value representing the cost.

MHertzog · April 1, 2019, 9:34pm

Thanks for the help . So I don’t need to free the memory manually? Anyhow I’d like to know how to do so properly, if you could give me some reference, haha. Thanks!

jmaronas · April 1, 2019, 9:35pm

You can just do what I told you.

Option 1:

loss_avg+=loss.data

Option 2:

with torch.no_grad():

     loss_avg+=loss

Take a look at the autograd documentation to see what does the torch tensor class stores