x = torch.tensor([5, 3], dtype=torch.float, requires_grad=True)
y = x
print(torch.autograd.grad(outputs=y, inputs=x, grad_outputs=torch.ones_like(y), create_graph=True))
The output is correct: (tensor([ 1., 1.]),)
However, if I tried to handle a sequence of tensors at a time, the result is unexpected:
I would expect the result to be: (tensor([ 1., 1.]), tensor([ 1., 1.]))
But, the actual result is: (tensor([ 2., 2.]), tensor([ 2., 2.]))
I would assume the calculation of gradients is independent when the inputs/outputs of autograd.grad function are sequence of Tensor, but it appears not.
Here is a slightly more complicated example:
Because, for a given variable, it will accumulate the computed gradients.
So if you give to autograd.grad two times y (that’s what you do by giving a list [y,y] as outputs), it understands you did two times the operation y=x, so the grad of y will sum – two times – the corresponding gradient. And it will return the gradient of each output of the list, hence two times the gradient of y, which was computed as so [1+1, 1+1].
If instead, you try:
x = torch.tensor([5, 3], dtype=torch.float, requires_grad=True)
y = x
z = x
torch.autograd.grad(outputs=[y,z], inputs=[x,x],
grad_outputs=[torch.ones_like(y), torch.ones_like(y)],
create_graph=True)
You obtain what you expected :
([1, 1], [1, 1])
because here no variable was accumulating two operations in the given sequence
Besides, after autograd.grad() is called, x.grad is None.
However, gradients must have been accumulated somewhere somehow during the calculation, but I don’t know how it is done (my second example in the original post is more tricky to find out how gradient is accumulated).
I know torch.autograd.backward() will accumulate gradients, should autograd.grad() also accumulate gradients? Apparently it does during the call, but the gradients won’t be save to variables.
So, if in the sequence you give, you have two times the same variable, the gradient for this variable will be accumulated two times. That’s as simple as this!
For this simple example, it is clear to see the gradient has been accumulated two times as you said. However, for the second example in my original post, it is not clear how it works. I am still not clear about how the computation graph is constructed when autograd.grad is called.
Now I understand how gradients are accumulated, thank you very much @alexis-jacq!
I think when autograd.grad is called, the currently computation graph is copied, and then backward() is called on the copied graph. That’s why the mechanism of gradient accumulation in autograd.grad() is the same as that in backward(). The difference is that when autograd.grad() is called, computation is done on the newly created graph (setting all gradients of leaf nodes to None if there is any), and the gradients won’t be accumulated or returned on the original graph.
I found the only way to avoid accumulation of gradients using autograd.grad on a sequence of Tensor is to create separate graphs without sharing nodes. For example:
x = torch.tensor([5, 3], dtype=torch.float, requires_grad=True)
y = x
w = torch.tensor([5, 3], dtype=torch.float, requires_grad=True)
z = w
torch.autograd.grad(outputs=[y,z], inputs=[x,w],
grad_outputs=[torch.ones_like(y), torch.ones_like(z)],
create_graph=True)
Then the result will be (tensor([ 1., 1.]), tensor([ 1., 1.]))
However, in order to calculate a Hessian matrix, we have to calculate second order partial derivatives separately, but there is no way to call zero_grad() during the the call of autograd.grad() on a sequence of Tensor. The only way to get around is to call autograd.grad() on one Tensor at a time, instead of on a sequence of Tensor at a time.
Please correct me if I am wrong.
is simply copying the reference of the Tensor object x to another two variable names y and z. All of the three hold reference to exactly the same object.