y.backward()
doesn’t just assign the value of y’(x) to x.grad
(say y
depends on x
). It actually adds y’(x) to the current value of x.grad
(think it as x.grad += true_gradient
).
In the following example, y.backward()
is called 5 times, so the final value of x.grad
will be 5*cos(0)=5.
import torch
from torch.autograd import Variable
x = Variable(torch.Tensor([[0]]), requires_grad=True)
for t in range(5):
y = x.sin()
y.backward()
print(x.grad) # shows 5
Calling x.grad.data.zero_()
before y.backward()
can make sure x.grad
is exactly the same as current y’(x), not a sum of y’(x) in all previous iterations.
x = Variable(torch.Tensor([[0]]), requires_grad=True)
for t in range(5):
if x.grad is not None:
x.grad.data.zero_()
y = x.sin()
y.backward()
print(x.grad) # shows 1
I also got confused by this “zeroing gradient” when first learning pytorch. The doc of torch.autograd.backward
does mention that
This function accumulates gradients in the leaves - you might need to zero them before calling it.
But this is quite hard to find and pretty confusing for (say) tensorflow users.
Official tutorials like 60 Minute Blitz or PyTorch with Examples both say nothing about why one needs to call grad.data.zero_()
during training. I think it would be useful to explain this a little more in beginner-level tutorials. RNN is a good example for why accumulating gradient (instead of refreshing) is useful, but I guess new users wouldn’t even know that backward()
is accumulating gradient