Why do we need to set the gradients manually to zero in pytorch?

y.backward() doesn’t just assign the value of y’(x) to x.grad (say y depends on x). It actually adds y’(x) to the current value of x.grad (think it as x.grad += true_gradient).

In the following example, y.backward() is called 5 times, so the final value of x.grad will be 5*cos(0)=5.

import torch
from torch.autograd import Variable

x = Variable(torch.Tensor([[0]]), requires_grad=True)

for t in range(5):
    y = x.sin() 
    y.backward()
    
print(x.grad) # shows 5

Calling x.grad.data.zero_() before y.backward() can make sure x.grad is exactly the same as current y’(x), not a sum of y’(x) in all previous iterations.

x = Variable(torch.Tensor([[0]]), requires_grad=True) 

for t in range(5):
    if x.grad is not None:
        x.grad.data.zero_()
    y = x.sin() 
    y.backward()

print(x.grad) # shows 1

I also got confused by this “zeroing gradient” when first learning pytorch. The doc of torch.autograd.backward does mention that

This function accumulates gradients in the leaves - you might need to zero them before calling it.

But this is quite hard to find and pretty confusing for (say) tensorflow users.

Official tutorials like 60 Minute Blitz or PyTorch with Examples both say nothing about why one needs to call grad.data.zero_() during training. I think it would be useful to explain this a little more in beginner-level tutorials. RNN is a good example for why accumulating gradient (instead of refreshing) is useful, but I guess new users wouldn’t even know that backward() is accumulating gradient :sweat_smile:

49 Likes