RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time

I have the same problem . I have see all the replly,but i can‘t find the right way to handle it .
The problem occur when i add the 'for loop ’ in the forward function. Hope your help!!

class ContrasiveMarginLoss(nn.Module):
    def __init__(self, num_features,num_classes,margin=0.2,model=None,dataloader=None,unselected=0):
        super(ContrasiveMarginLoss, self).__init__()
        self.margin = margin
        self.model = model

        self.register_buffer('V',torch.zeros(num_classes, num_features))
        self.V = extract_features(self.model,self.loader,self.V).to(device)
        self.V = normalize(self.V)

        self.unselected_data = unselected

        if margin is not None:
            self.ranking_loss = nn.MarginRankingLoss(margin=margin,reduction='sum')
            self.ranking_loss = nn.SoftMarginLoss()

    def forward(self,features,labels,normalize_feature=True):
        if normalize_feature:
            features = normalize(features)
        #dist,dist_max,y = ComputeDist(self.V)(features,labels)
        N = features.size(0)

        if normalize_feature:
            features = normalize(features)        #[batch_size,2048]
        dist = euclidean_dist(features,self.V)    #[16,12185]
        dist_max,y = sample_mining(dist,labels)

        V_temp = Variable(self.V)
        for m, n in zip(features,labels):
            V_temp[n] = F.normalize( (V_temp[n] + m) / 2, p=2, dim=0)
        self.V = V_temp

        loss = (1/N) * self.ranking_loss(dist,dist_max,y)

        return loss,dist,dist_max
1 Like


What is you extract_features function doing? Make sure that self.V does not require gradients during your __init__ otherwise, the part of the graph will be shared by any forward using it.
Also Variables don’t exist anymore, so you can simple remove any use of them.

1 Like

The extract_features use the pretrained model to extract the features of all data. The follow is the code:

def extract_features(model=None,dataloader=None,buffer=None):
    with torch.no_grad():
        for data in dataloader:
            imgs, _, pids, indexs, _ = data
            targets =
            ide_pred, u_feat = model(imgs)
            for i, index in enumerate(targets):
                buffer[index] = u_feat[i]
    return buffer

What i want to do is to save the features of all data in the self.V as buffer to avoid computing repeatly!
In the “for loop”,I want to use the new features to update the Buffer self.V ,but the error occur!!!

How come the following does NOT produce an error:

x = torch.ones(2, 2, requires_grad=True)
y = x + 2
y.backward(torch.ones(2, 2)) # Note I do not set retain_graph=True
y.backward(torch.ones(2, 2))


This is an edge case, since the only op you do is an add and the add does not need any buffer, then there is no buffer that are missing when you do the second backward :slight_smile:


I found sth really interesting

This is my code


Why the b.grad is None?
And with reference to the chained rule, will a.grad to be 0?

I found sth really interesting

This is my code

Why the b.grad is None?
And with reference to the chained rule, will a.grad to be 0?

Check this Why cant I see .grad of an intermediate variable?

Thanks for your reply :hugs:

But as we optmize the params, we need the d(loss)/d(w) to change the parameters, but if we only retain the inputs’ grad and drop the internediate result(actually some of them are the params.grad), how can we optimze the model(change the params)?

As @smth mentioned in the link, ''By default, gradients are only retained for leaf variables. non-leaf variables’ gradients are not retained to be inspected later. This was done by design, to save memory." I think weights and bias in a network should be leaf variables and their grads are retained. (Correct me if I’m wrong) In your example, you may call b.is_leaf to see it’s False and a.is_leaf is True.

Exactly what you say!
This is to say the params are also the inputs of the model as they store in the leaf node?But we get them by initialization not like the data from dataset. Am i right? :hugs:

I came across the same problem of fetching gradient of non-leaf node last week.

Pytorch does not keep gradient for non-leaf node unless you call retain_grad() explicit for some tensor.

Notice that a = creates a new node and makes a non-leaf.
However, model = is usually safe. Parameters usually resides in Module object. Calling to() on Module is taken care to keep parameters leaf-node still by operating on

Btw, u r everywhere in the forum, maimeng is shameful😏

1 Like

I am really everythere hahaha

But it is really hard to understand the graph

With the code

dataiter = iter(test_loader)
images, labels =

We get the output


As we know, we will never do d(loss)/d(image) and d(loss)/d(label). So we only retain the grad of the leaf node with requires_grad =True?


Gradient w.r.t. data is useful in generating adversarial examples, for attack/defense or domain generalization.

I’ve never heard of any application of gradient w.r.t. label. But its meaning is clear: the most dissimilar label direction of the example, learned in current model. Any reference about this will be appreciated.


With the code

dataiter = iter(test_loader)
images, labels =

output1 = Mnist_Classifier(images)
loss = loss_fn(output1, labels)

print(images.is_leaf) #1

The output is


And this is what i expect.
But if i change postion 1 to be print(images[0].is_leaf), the correspond output become False, Why?

Slicing is also a operation that creats a new node, and a[0] create a new node. You can print its grad_fn, that will be like <SelectBackward at 0xffffffff>.

But why




The results of labels and iamges seem different.
Is the different dimensions of images and labels make they act differently?


Because labels[0] creates a brand new Tensor that happens to be a leaf. That being said, this newly created Tensor has not been used in any computation yet (and will never be as you did not saved it, it got destroyed just after the print). The second print creates another such new leaf Tensor and this one is brand new as well and so it’s .grad field is None as any newly created Tensor.

1 Like

For a complement,

All Tensors that have requires_grad which is False will be leaf Tensors by convention.