What's the purpose of designing optimizer.zero_grad()

I don’t understand since we always use optimizer.zero_grad() to clear the gradinet saved in the variable, why it keeps the last gradient.
Is there any other place to use the previous gradient?

You could simulate a larger batch size by accumulating the gradients from a few forward passes and call backward() on these.
Also in the DCGAN example the gradients from the “real” and “fake” loss are accumulated and the optimizer is called just after both backward passes were called.

It gives you more flexibility, if you would like to experiment with some crazy stuff! :wink:

1 Like