What does doing .backward twice do?

I was curious to call backward twice in series, where I had a sequential computation (graph). I have a loss function in the middle and another at the end. I call backward on both of them. That does not seem to be equivalent to getting the second derivative (cuz I did it manually with Sympy). I wasn’t trying to do the second derivative (though it would be nice to figure that out too), but was curious what that is doing.

So what is it doing?

Experiment results:

---- test results ----
experiment_type = grad_constant_but_use_backward_on_loss
-- Pytorch results
g = 12.0
dJ_dw = 16838.0
-- Sympy results
g_SYMPY = 2*x*(w*x - y)
dJ_dw_SYMPY = 197258.000000000
---- test results ----
experiment_type = grad_analytic_only_backward_on_J
-- Pytorch results
g = tensor([12.], grad_fn=<MulBackward0>)
dJ_dw = 197258.0
-- Sympy results
g_SYMPY = 2*x*(w*x - y)
dJ_dw_SYMPY = 197258.000000000


    ## variable declaration
    w = torch.tensor([2.0], requires_grad=True)

    x = torch.tensor([3.0], requires_grad=False)
    y = torch.tensor([4.0], requires_grad=False)

    x2 = torch.tensor([5.0], requires_grad=False)
    y2 = torch.tensor([6.0], requires_grad=False)
    if True:
        ## computes backard pass on J (i.e. dJ_dw) but g is was backwarded passed already
        # compute g
        loss = (w*x-y)**2
        g = w.grad.item() # dl_dw
        # compute w_new
        w_new = w - (g+w**2) * g
        # compute final loss J
        J = (w_new + x2 + y2)**2
        # computes derivative of J
        #dw_new_dw = w_new.grad.item()
        dJ_dw = w.grad.item()

    print('---- test results ----')
    print(f'experiment_type = {experiment_type}')
    print('-- Pytorch results')
    print(f'g = {g}')
    #print(f'dw_new_dw = {dw_new_dw}')
    print(f'dJ_dw = {dJ_dw}')
    print('-- Sympy results')
    g, dw_new_dw, dJ_dw = symbolic_test(experiment_type)
    print(f'g_SYMPY = {g}')
    #print(f'dw_new_dw_SYMPY = {dw_new_dw}')
    print(f'dJ_dw_SYMPY = {dJ_dw}')

and sympy code:

def symbolic_test(experiment_type):
    w, x, y, x2, y2, g = symbols('w x y x2 y2 g')

    loss = (w*x - y)**2
    grad = diff(loss,w)
    if experiment_type == 'grad_constant':
        ## compute g
        eval_grad = grad.evalf(subs={w:2,x:3,y:4})
        g = eval_grad
        ## compute w_new
        w_new = w - (g+w**2) * g
        ## compute final loss J
        J = (w_new + x2 + y2)**2
        dw_new_dw = diff(w_new,w).evalf(subs={w:2,x:3,y:4})
        dJ_dw = diff(J,w).evalf(subs={w:2,x:3,y:4,x2:5,y2:6})
        ## include grad as a symbolic variable into next expressions
        g = grad
        ## compute w_new
        w_new = w - (g+w**2) * g
        ## compute final loss J
        J = (w_new + x2 + y2)**2
        dw_new_dw = diff(w_new,w).evalf(subs={w:2,x:3,y:4})
        dJ_dw = diff(J,w).evalf(subs={w:2,x:3,y:4,x2:5,y2:6})
    return g, dw_new_dw, dJ_dw

interesting! It seems it doesn’t matter which of these two I used:

        g = w.grad.item() # dl_dw
        #g = w.grad # dl_dw

This still does seem to make a difference!

        g = w.grad # dl_dw
        g.requires_grad = True

cross posted:


A few points:

  • Is that expected that the experiment types you use do not match the string tested in the if statement in the sympy code?
  • Could you provide the code for the two different pytorch runs you do?
  • I would avoid using .item() in pytorch as it unpacks the content into a regular python number and thus it breaks gradient computation. If you want to have a new Tensor such that no gradients will flow back, you should use .detach().
  • Note that when you call .backward(), it accumulates into the .grad field. So here you accumulate both the gradients for loss and J into w.grad in the code sample you provided.

need to read:


Hi AlbanD,

I was puzzled about your comment on using .item(). Why would it break gradient computations? Do you have a high-level explanation of why it does that?

Perhaps more importantly, doesn’t that mean that using python floats in torch might cause problems? (since that’s what .item() extracts) Does that mean I should wrap everything in torch variables?

Thanks in advance!

I don’t want to accumulate (which is what happens when I call backwards multiple time in the same computation graph). I want to extract the gradients for params wrt to loss and use those gradients (call them dl_dw) for later computations of J and dJ_dw. The way I’m doing it is not right because I can’t even call backward on J because when I try to zero out the gradients before calling backward on J pytorch halts on me due to illegal in-place operations.

However, I noticed that pytorch doesn’t allow me to do because it warns me about (not really (illegal)) in-place operations in my case. Is there a way to extract intermediate gradients without pytorch halting without my permission?

In case you need it, see new example code I cooked up for this

import torch
from torchviz import make_dot

x = torch.ones(10, requires_grad=True)
weights = {'x':x}

y = x**2
z = x**3

l = (x-2).sum()
g_x = x.grad

#g_x.requires_grad = True ## Adds it to the computation graph!
print(f'g_x = x.grad = {x.grad}\n')


#weights['g_x'] = g_x
#print(f'weights = {weights}\n')

J = (y + z + g_x).sum()
print(f'g_x = x.grad = {x.grad}\n')



RuntimeError                              Traceback (most recent call last)
<ipython-input-27-c4086df8154e> in <module>
     20 print(f'g_x = x.grad = {x.grad}\n')
---> 22 x.zero_()
     24 #weights['g_x'] = g_x

RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.

Because it returns a python number and we cannot track gradients for python numbers, only torch Tensors.

Does that mean I should wrap everything in torch variables?

It will be done automatically for you whenever needed. The Tensors you create that require gradients are Tensor already.
So you don’t need to wrap anything. But you should not unwrap things either.


This operation on x is not valid. You cannot change its value in a differentiable manner as it is a leaf…
If you just want to zero-out the content of x without the autograd knowing, you should do:

with torch.no_grad():

I was reading your post on what torch.no_grad() does and it says I won’t be able to do backward passes on my computation graph if I use that (though with some speed/memory benefits).

That is not what I am trying to do. I want things to be differentiable but I want to be able to clear out gradients of sub-graphs whenever I want.

More clearly explained: I create a forward pass and in a earlier part of the graph I want to collect gradients to later use those gradients to create the whole computation graph. But the gradients are part of the graph and I want them zeroed out since they should be treated as constants part of the computation graph. Thus I will call backward twice and zero out the intermediate gradients collected (but I will extract to use them later).

Does that make sense?

You meant to do x.grad.zero_() ?

1 Like

Oh I see. .zero_() makes all the content of a tensor zero (not just gradients). But I was trying to make the contents of a tensor that is part of a computation graph zero (with an in-place operation that is not tracked). So Pytorch was trying to protect me from it. But Pytorch was ok with x.grad being zeroed because it is not part of the computation graph so far, so it’s fine to call zero_() on it.

The (wrong) assumption I made was that .zero_() was built to make gradients zero, not tensors. So naturally, I was very confused as of why pytorch was complaining.

Thanks AlbanD! Your such a boss in this forum. :slight_smile: :muscle:

1 Like