Autograd.grad accumulates gradients on sequence of Tensor making it hard to calculate Hessian matrix

tianle · May 13, 2018, 8:49pm

I encountered some unexpected behavior when using torch.autograd.grad torch.autograd — PyTorch 2.1 documentation

Here is a toy example:

x = torch.tensor([5, 3], dtype=torch.float, requires_grad=True)
y = x
print(torch.autograd.grad(outputs=y, inputs=x, grad_outputs=torch.ones_like(y), create_graph=True))

The output is correct:
(tensor([ 1., 1.]),)

However, if I tried to handle a sequence of tensors at a time, the result is unexpected:

torch.autograd.grad(outputs=[y,y], inputs=[x,x], 
                          grad_outputs=[torch.ones_like(y), torch.ones_like(y)], 
                          create_graph=True)

I would expect the result to be:
(tensor([ 1., 1.]), tensor([ 1., 1.]))
But, the actual result is:
(tensor([ 2., 2.]), tensor([ 2., 2.]))

I would assume the calculation of gradients is independent when the inputs/outputs of autograd.grad function are sequence of Tensor, but it appears not.
Here is a slightly more complicated example:

I am not sure if this is the intended behavior. Any suggestion is appreciated.

alexis-jacq · May 13, 2018, 9:49pm

Because, for a given variable, it will accumulate the computed gradients.

So if you give to autograd.grad two times y (that’s what you do by giving a list [y,y] as outputs), it understands you did two times the operation y=x, so the grad of y will sum – two times – the corresponding gradient. And it will return the gradient of each output of the list, hence two times the gradient of y, which was computed as so [1+1, 1+1].

If instead, you try:

x = torch.tensor([5, 3], dtype=torch.float, requires_grad=True)
y = x
z = x
torch.autograd.grad(outputs=[y,z], inputs=[x,x], 
                          grad_outputs=[torch.ones_like(y), torch.ones_like(y)], 
                          create_graph=True)

You obtain what you expected :

([1, 1], [1, 1])

because here no variable was accumulating two operations in the given sequence

tianle · May 14, 2018, 12:15am

Thanks for your reply @alexis-jacq .
I tried your code, but still get unexpected result on my machine (pytorch0.4):

Besides, after autograd.grad() is called, x.grad is None.
However, gradients must have been accumulated somewhere somehow during the calculation, but I don’t know how it is done (my second example in the original post is more tricky to find out how gradient is accumulated).

I know torch.autograd.backward() will accumulate gradients, should autograd.grad() also accumulate gradients? Apparently it does during the call, but the gradients won’t be save to variables.

Besides, my original goal of using autograd.grad is to calculate Hessian matrix (not a Hessian vector: pytorch/test/test_autograd.py at main · pytorch/pytorch · GitHub). Any suggestion on this is appreciated.

alexis-jacq · May 14, 2018, 9:54am

Oh sorry, I edited my post:
It’s outputs=[y,z], not outputs=[y,y] that I meant !

torch.autograd.grad(outputs=[y,z], inputs=[x,x], 
                          grad_outputs=[torch.ones_like(y), torch.ones_like(y)], 
                          create_graph=True)

So, if in the sequence you give, you have two times the same variable, the gradient for this variable will be accumulated two times. That’s as simple as this!

tianle · May 14, 2018, 11:52am

Thanks @alexis-jacq .

I tried you modified code, but still got the same result on my machine:

For this simple example, it is clear to see the gradient has been accumulated two times as you said. However, for the second example in my original post, it is not clear how it works. I am still not clear about how the computation graph is constructed when autograd.grad is called.

alexis-jacq · May 14, 2018, 1:38pm

Try print(y.grad)… Of course, x has again accumulated two gradients.

In your second example, the accumulation is more complexe, but still simple :

grad(y) = dy/dx = 2,2;        grad(z) = ones_like(y) = [1,1] # call for grad(y,x)
  |                             |
  +                             +
  |                             |
grad(y)=dy/dx*dz/dy = 40,24 ; grad(z) = dz/dy = 20,12 # call for grad(z,y)
  ||                            ||
42,26                          21,13  # call for grad([z,y],[y,x])

tianle · May 14, 2018, 4:33pm

Now I understand how gradients are accumulated, thank you very much @alexis-jacq!

I think when autograd.grad is called, the currently computation graph is copied, and then backward() is called on the copied graph. That’s why the mechanism of gradient accumulation in autograd.grad() is the same as that in backward(). The difference is that when autograd.grad() is called, computation is done on the newly created graph (setting all gradients of leaf nodes to None if there is any), and the gradients won’t be accumulated or returned on the original graph.

I found the only way to avoid accumulation of gradients using autograd.grad on a sequence of Tensor is to create separate graphs without sharing nodes. For example:

x = torch.tensor([5, 3], dtype=torch.float, requires_grad=True)
y = x
w = torch.tensor([5, 3], dtype=torch.float, requires_grad=True)
z = w
torch.autograd.grad(outputs=[y,z], inputs=[x,w], 
                          grad_outputs=[torch.ones_like(y), torch.ones_like(z)], 
                          create_graph=True)

Then the result will be
(tensor([ 1., 1.]), tensor([ 1., 1.]))

However, in order to calculate a Hessian matrix, we have to calculate second order partial derivatives separately, but there is no way to call zero_grad() during the the call of autograd.grad() on a sequence of Tensor. The only way to get around is to call autograd.grad() on one Tensor at a time, instead of on a sequence of Tensor at a time.
Please correct me if I am wrong.

Thanks!

Johnny-Wish · August 19, 2019, 1:06pm

I don’t think this will help. Because setting

y = x
z = x

is simply copying the reference of the Tensor object x to another two variable names y and z. All of the three hold reference to exactly the same object.