RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time

petered · August 28, 2017, 8:58am

I keep running into this error:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

Can someone explain what this means? Independent of the context of the problem, I don’t understand what the buffers are and what it means for them to be “freed”.

Context: In my case, it happens the second time I call loss.backward(), in a function: where the model’s function execute one step of a recurrent network, which updates its hidden state in the forward pass

def train_it(self, x, y):
    prediction = self(x)
    self.zero_grad()
    loss = self.loss(prediction, y)
    loss.backward()
    self.optimizer.step()

albanD · August 28, 2017, 9:04am

To reduce memory usage, during the .backward() call, all the intermediary results are deleted when they are not needed anymore. Hence if you try to call .backward() again, the intermediary results don’t exist and the backward pass cannot be performed (and you get the error you see).
You can call .backward(retain_graph=True) to make a backward pass that will not delete intermediary results, and so you will be able to call .backward() again. All but the last call to backward should have the retain_graph=True option.

petered · August 28, 2017, 10:00am

Ah, now I understand, thanks!

mattroos · September 21, 2017, 3:02pm

This is still unclear to me and it may impact my particular use case. To demonstrate, I’m building a very simple recurrent network (one node) and unrolling it for training. At the start of each gradient descent iteration I need to reset the output/estimate vector to initial values. I want to do this in an efficient way that doesn’t require allocating new GPU memory every time. But when I try to do that, I get the .backward() error as in the title of this thread. Hopefully this code will make the issue clear. @albanD, do you have a suggestion?

import torch
from torch.autograd import Variable as V

# GOAL: Do gradient descent on unrolled network that is simply:
#   y = y + w*x  -->  y(t) = y(t-1) + w*x(t)

# Random training data
X = V(torch.randn(100,1).cuda())
Y = V(torch.randn(100,1).cuda())

nGD = 10        # number of gradient descent iterations
nTime = 5       # number of unrolling time steps

# Initialize things
gamma = 0.1
w = V(torch.randn(1,1).cuda(), requires_grad=True)
Yest = V((0.5*torch.ones(100,1)).cuda()) # Don't really care about values in Yest at this point, just allocating GPU memory

for iGD in range(nGD):
    # At start of processing for each GD iteration, the
    # output estimate, Yset, should be set to an initial
    # estimate of some fixed value.  E.g., ...
    Yest.data.zero_().add_(0.5)                # This line fails on second GD iteration.
    # Yest = V((0.5*torch.ones(100,1)).cuda()) # This works, but I think it's allocating GPU memory every time, and thus slow.

    for iTime in range(nTime):
        Yest = Yest + w*X
    cost = torch.mean((Y - Yest)**2)    
    cost.backward() # compute gradients
    w.data.sub_(gamma*w.grad.data) # Update parameters
    w.grad.data.zero_() # Reset gradients to zeros

Note that if I set retain_graph=True, then the code runs for a long time. I get the feeling it keeps adding to the graph with each GD iteration in that case, which is not what I want.

albanD · September 21, 2017, 3:19pm

Hi,

The problem in you case is that inside your training loop you do Yest = Yest + w*X which modifies the “preallocated” buffer that you created. The fact that you reset his value, does not remove his history and so your computational graph grows as it remembers averything.
You can solve that by just changing this for loop to use another name like Yest_local = Yest_local + w*X and set before the loop Yest_local = Yest.

That being said, a better way to solve your issue is to keep buffers as tensor and wrap them into Variables only when you need them. That way, you are sure that you will not accidentaly have your graph expanding to previous iterations.
In this case, you would define this as follow:
Before the training loop:
Yest_buffer = torch.zeros(100, 1).cuda()
For each iteration of the training loop:

Yest_buffer.zero_().add_(0.5)
Yest = V(Yest_buffer)

Keep in mind that wrapping a Tensor into a Variable is completely free. And you should always wrap your tensors as late as possible. If possible, just before the moment where you actually need the autograd.

mattroos · September 23, 2017, 2:42am

Perfect. Thanks so much!!!

jdhao · November 8, 2017, 9:35am

In each minibatch gradient descent, the computation graph is created once, then we backprop the loss to update the model weight. Is that right?

Also I would like to know what kind of intermediate results are freed so that we can not backward() the final loss anymore, unless we use retain_graph=True. AFAIK, the computation graph is composed of variables and math operations. Since the variables remains, does it imply that the freed info is a series of math operations that lead to the final loss?

albanD · November 8, 2017, 10:21am

What is freed is everything that is not accessible from python in any other way than during the backward pass.
Some operations store extra temporary buffers, but also intermediary Variables that are not accessible from python will be freed.

abandon_tf · November 21, 2017, 3:35am

I have the similar problem like this. My model returns two value, pos_score and neg_scores, first is the positive value the other is the sampled negative values.
My loss funciton like this:
def loss_function(pos_score, neg_scores):

		return  -torch.mean(torch.log(
				F.sigmoid(pos_score - neg_scores)),0) #BPR loss

I still cannot find where the preallocated buffer updated when I run the model on second time.
Could you please help me with that?Thanks!

SimonW · November 21, 2017, 6:21am

How do you “run the model on the second time”? This loss function is fine. Could you share more code?

abandon_tf · November 21, 2017, 6:27am

Thank you for your replay. Below is my code.

			neg_targets = sample_neg()
			model.zero_grad()

			pos_score, neg_scores = model(autograd.Variable(torch.LongTensor(inputs_[reviewer][i])),
										  autograd.Variable(torch.LongTensor([targets[reviewer][i][-1]])),
										  autograd.Variable(torch.LongTensor(neg_targets)),
										  rnn_first, rnn_last,
										  autograd.Variable(torch.FloatTensor(queries[reviewer][i])),
										  autograd.Variable(torch.FloatTensor([c_query[reviewer][i][-1]])),
										  update)


			loss = loss_function(pos_score, neg_scores)
			loss.backward()

			nn.utils.clip_grad_norm(model.parameters(), 0.5)
			optimizer.step()

Neg_targets are sampled negative examples. rnn_first, rnn_last and update are bool type, indicate different judgement condition. This is in my train loop.
It can run with retain_graph=True. But I’m not sure which part needs more backward.

SimonW · November 21, 2017, 6:49am

These are all fine. Are you saving any intermediate results in your model?

abandon_tf · November 21, 2017, 7:08am

Yes! I think that is where problem maybe in! I create a list before train the model. And when the indicator ‘update’ is True, I will append some parameters of the model to this list.

But I really need to save these parameters. Is there another solutions to avoid setting the retrain_graph=True so that the model wouldn’t require dozens of memories as in the first round?

jdhao · November 21, 2017, 7:31am

If I understand it correctly, you are saving some intermediate value for inspection later, right? If that is the case, make sure that you save the raw data, as opposed to saving the Variable.

For example, if you wan to record total_loss of the epoch, after each batch, you should use total_loss += loss.data[0] (loss is the loss calculated for a batch) instead of total_loss += loss.

abandon_tf · November 21, 2017, 7:56am

After I tried several times, I found that’s not the key.
Now I wonder what graph buffers and when will pytorch free? The graph buffers mean all the Variables(including require_grad=True and require_grad=False)?

jdhao · November 21, 2017, 8:04am

In typical case, for each batch, a new graph is created, after the backward pass, the graph is freed. Then next batch comes, and new graph is created. The process repeats until you finished your training.

For more info, you can read my post on computaion graph in PyTorch, hope it can help you.

SimonW · November 21, 2017, 2:23pm

If possible, pm me your code and I will take a look.

abandon_tf · November 21, 2017, 2:37pm

I think my code is totally in a mess…Every loop in train I want to save self.num_steps hidden state for further use. And rnn_first means gru’s inputs are self.num_steps. If leave out rnn_last(Because the bug arises before it), other times except rnn_first only input the last one of inputs(My data is formed intoself.num_steps blocks). So I reset gru self.state to the last one of self.rnn_state.
I’m not sure if you could understand me or nor…

def rnn(self, rnn_first, rnn_last, inputs_r, target):

    rnn_inputs = self.embed(inputs_r)
    rnn_inputs = F.alpha_dropout(rnn_inputs, p=self.keep_prob)

    self.initial_state = torch.zeros(1, self.batch_size, self.global_dim) #Initialize hidden state

    if rnn_first:
        self.state = autograd.Variable(self.initial_state)
        for index, i in enumerate(rnn_inputs):
            rnn_out, self.state = self.gru(i.view(1, self.batch_size, self.global_dim),
                                       self.state)
            #Store every hidden state 
            if index == 0:
                state_s = self.state.view(1, self.global_dim) 
            else:
                state_s = torch.cat((state_s, self.state.view(1, self.global_dim)), 0)

        self.rnn_states = state_s


    elif rnn_last: #If this is the last one, should add the last asin to the rnn(no output needed)
        self.state = self.rnn_states[-1].view(1, self.batch_size, self.global_dim)
        rnn_input = self.embed(target)
        output, self.state = self.gru(rnn_input.view(1, self.batch_size, self.global_dim), self.state)
        
        self.rnn_states = torch.cat(torch.split(self.rnn_states, 1)[1:])
        self.rnn_states = torch.cat((self.rnn_states, self.state.view(1, self.global_dim)))


    else: #Not the first time_step for each user, process one asin each time
        self.state = self.rnn_states[-1].view(1, self.batch_size, self.global_dim)
        rnn_input = rnn_inputs[-1]
        output, self.state = self.gru(rnn_input.view(1, self.batch_size, self.global_dim), self.state)

        self.rnn_states = torch.cat(torch.split(self.rnn_states, 1)[1:])
        self.rnn_states = torch.cat((self.rnn_states, self.state.view(1, self.global_dim)))

SimonW · November 21, 2017, 2:42pm

This is the problem. Every variable you save has the computation graph associated. If you use that in a future forward, then the previous computation graph needs to be there (retain_graph=True) or it will complain since it can’t properly backprop.

Why do you need those? Could you just backprop a limit number of steps?

abandon_tf · November 21, 2017, 2:47pm

Thanks for your quick reply!
Actually I’m trying to implementing an attention mechanism part. So I need these hidden state to multiply the attention weight.