When is the better time to zero the gradients?

Hi,

The second one is most likely a bug.
Remember that .backward() accumulates gradients and so zero_grad() should be called such that you clear the old gradients before accumulating the new ones.
The second example here actually zero out the gradients just before using them. So the step won’t do much as it will always be given 0 gradients (it may still move due to regularization and momentum terms though).

In general, I would advice calling zero_grad() before backward() instead of after .step() so that you are sure the gradients are always cleared (especially in the first iteration of your loop).

3 Likes