RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time

Thanks a lot!
Perfect answer!
Wish u a good day!

1 Like

I’m getting this same issue and cant quite figure out where the re-use of an already cleared variable could be, am I missing something obvious?

code:

def train(epoch, model, train_loader, optimizer, criterion, summary, use_gpu=False, log_interval=10):
    correct = 0
    total = 0
    for idx, (x, y) in enumerate(train_loader):
        y = y.squeeze(1)
        x, y = Variable(x), Variable(y)
        x = x.cuda() if use_gpu else x
        y = y.cuda() if use_gpu else y

        preds = model(x)

        loss = criterion(preds, y)
        loss.backward(retain_graph=True)
        optimizer.step()
        optimizer.zero_grad()

        # TODO: turn this part into callbacks
        if idx % log_interval == 0:
            # Log loss
            index = (epoch * len(train_loader)) + idx + 1
            avg_loss = loss.data.mean()
            summary.add_scalar('train/loss', avg_loss, index)

            # Log accuracy
            total += len(x)
            pred_classes = torch.max(preds.data, 1)[1]
            correct += (pred_classes == y.data).sum()
            acc = correct / total
            summary.add_scalar('train/acc', acc, index)

Without retain_graph=True I get the same exception as above

1 Like

Very elegant and interesting example. So if I want to backward more than once without retain_graph, I need to redo the computation from all leaves?

1 Like

Yes, because what a backward without retrain graph is basically a “backward in which you delete the graph as you go along”.

1 Like

What are the ’ intermediary results ’ exactly?
This process is giving me vertigo. What exactly happens that creates and then deletes the ‘results’?

1 Like

Intermediary results are values from the forward pass that are needed to compute the backward pass.
For example, if your forward pass looks like this and you want gradients for the weights.

middle_result = first_part_of_net(inp)
out = middle_result * weights

When computing the gradients, you need the value of middle_result. And so it needs to be stored during the forward pass. This is what I call intermediary results.

These intermediary results are created whenever you perform operations that require some of the forward tensors to compute their backward pass.
To reduce memory usage, during the backward pass, these are deleted as soon as they are not needed anymore (of course if you use this Tensor somewhere else in your code you will still have access to it, but it won’t be stored by the autograd engine anymore).

5 Likes

Thank you.
So this .backward() method is called behind the scenes in Pytorch somewhere inside when the optim.Adam method is run?
I’ve seen a few examples of neural networks with Pytorch but I don’t get where the weights are.

You can check the tutorials on how to train a neural network and what each function is doing.

1 Like

I am not sure how the correlation between retain_graph=True and zero_grad() works. Have a look at this:

(code is adapted from the answer and might not be 100% correct, but I hope you get what I mean)

prediction = self(x)
self.zero_grad()
loss = self.loss(prediction, y)
loss.backward(retain_graph=True) #retains weights --> gradients?
loss.backward() ## add gradients to gradients? makes them a lot stronger?

vs:
loss.backward(retain_graph=True) #retains weights? --> gradients
self.zero_grad()
loss.backward() ## gradients are zero, how is retain_graph=True effective in this case?

or is retain_graph just keeping the weights rather than the gradients? I am a bit confused.

Hi,

retain_graph has nothing to do with gradients. It just allows you to call backward a second time. If you don’t set it in the first .backward() call, you couldn’t call backward a second time.

1 Like

Are you sure you get an error if you don’t use retain_graph = True?

This seems to be the normal protocol of running a model inside iterations, as the model creates a new graph every time for calculating preds.

If you could be more specific about the problem you have faced, it would be very good for me.

Thanks

Hi,

Do I need to wrap up tenors or variables that are not being used in autograd ? Like for example , below is my training code. I have few numpy arrays and lists created to store resutls only. Do I need to wrap them up too ? :

def train(epoch):

trainSeqs = dataloader.train_seqs_KITTI
trajLength = range(dataloader.minFrame_KITTI,dataloader.maxFrame_KITTI, 10)

rn.shuffle(trainSeqs)
rn.shuffle(trajLength)

avgT_Loss=0.0
avgR_Loss=0.0

num_itt=0;

avgRotLoss=[];
avgTrLoss=[];

loss_itt=np.empty([cmd.itterations,2])

for seq in trainSeqs:
	for tl in trajLength:
		# get a random subsequence from 'seq' of length 'fl' : starting index, ending index
		stFrm, enFrm = dataloader.getSubsequence(seq,tl,cmd.dataset)
		# itterate over this subsequence and get the frame data.
		flag=0;
		print(stFrm,enFrm)
		for frm1 in range(stFrm,enFrm):
	
	
			inp,axis,t = dataloader.getPairFrameInfo(frm1,frm1+1,seq,cmd.dataset)

			deepVO.zero_grad()
			# Forward, compute loss and backprop
			output_r, output_t = deepVO.forward(inp,flag)
			loss_r = criterion(output_r,axis)
			loss_t = criterion(output_t,t)
			# Total loss
			loss = loss_r + cmd.scf*loss_t ;
			if frm1 != enFrm-1:
				loss.backward(retain_graph=True)
			else:
				loss.backward(retain_graph=False)
			optimizer.step()


			avgR_Loss = (avgR_Loss*num_itt + loss_r)/(num_itt+1)
			avgT_Loss = (avgT_Loss*num_itt + loss_t)/(num_itt+1)
		
			loss_itt[num_itt,0] = loss_r;
			loss_itt[num_itt,1] = loss_t;

			flag=1;
			num_itt =num_itt+1
			print(num_itt)
			if num_itt == cmd.itterations:
				avgRotLoss.append(np.average(loss_itt[:,0]))
				avgTrLoss.append(np.average(loss_itt[:,1]))
				print(np.average(loss_itt[:,0]) , np.average(loss_itt[:,1]))
				num_itt=0;


				
plt.plot(avgRotLoss,'r')
plt.plot(avgTrLoss,'g')

plt.save('/u/sharmasa/Documents/DeepVO/plots/epoch_' + str(epoch))

@albanD
I still Don’t understand why I call forward() function before loss.backward, the compiler still trigger the error:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I’m not sure to understand your question.
Maybe open a new thread as this one is quite long already.

Thanks @albanD. I think the bug of my previous report was irrelevant about this topic. I forgot to initialize the hidden state of the LSTM. So I closed my report.

hi,

Could you format your code using 3 backticks please ``` such that it’s more readable.
Also, which version of pytorch are you using? As some big changes have been made on the backward of the graph and this might be fixed in the latest version.
Otherwise I’m not sure why this would happen in your case but most certainly some state saved from one iteration to the next.

The bug is solved. So the report is closed.

Thanks and this is clear. But if this is the case, I just wonder why the mini-batch update works, backward is called for each mini-batch, such as this one link

2 Likes

I have the same problem . I have see all the replly,but i can‘t find the right way to handle it .
The problem occur when i add the 'for loop ’ in the forward function. Hope your help!!

class ContrasiveMarginLoss(nn.Module):
    def __init__(self, num_features,num_classes,margin=0.2,model=None,dataloader=None,unselected=0):
        super(ContrasiveMarginLoss, self).__init__()
        self.margin = margin
        self.model = model
        self.loader=dataloader

        self.register_buffer('V',torch.zeros(num_classes, num_features))
        self.V = extract_features(self.model,self.loader,self.V).to(device)
        self.V = normalize(self.V)

        self.unselected_data = unselected

        if margin is not None:
            self.ranking_loss = nn.MarginRankingLoss(margin=margin,reduction='sum')
        else:
            self.ranking_loss = nn.SoftMarginLoss()

    def forward(self,features,labels,normalize_feature=True):
        if normalize_feature:
            features = normalize(features)
        #dist,dist_max,y = ComputeDist(self.V)(features,labels)
        N = features.size(0)

        if normalize_feature:
            features = normalize(features)        #[batch_size,2048]
        dist = euclidean_dist(features,self.V)    #[16,12185]
        dist_max,y = sample_mining(dist,labels)

        V_temp = Variable(self.V)
        for m, n in zip(features,labels):
            V_temp[n] = F.normalize( (V_temp[n] + m) / 2, p=2, dim=0)
        print(V_temp)
        self.V = V_temp


        loss = (1/N) * self.ranking_loss(dist,dist_max,y)

        return loss,dist,dist_max
1 Like

Hi,

What is you extract_features function doing? Make sure that self.V does not require gradients during your __init__ otherwise, the part of the graph will be shared by any forward using it.
Also Variables don’t exist anymore, so you can simple remove any use of them.

1 Like