Running SGD optimizer with optimizer.zero_grad()

Hello, I am performing image classification using SGD optimizer. In the meantime, I accidentally trained without using optimizer.grad_zero (), but I could see better results than before. Can you guess for what reason?

optimizer.grad_zero() resets gradients of parameters to zero. By not calling it, this means when you call .backward() to your loss, then this will add the current gradients (wrt current loss) to the gradients that were already assigned to the parameters (wrt previous loss).

So here basically you’re updating your model parameters with large gradients (since gradients are adding as we move on to next batch).

Now in case of SGD, which is a simple algorithm without any complications, this might work as this algorithm is already pretty slow, so in this case not calling optimizer.zero_grad() is providing you large gradients, hence SGD might work better in this case. Also my guess is your model is also quite simple (unlike resnet20).

But if you’re model is large and you’re using other gradient descent varients like adam, nadam, rmsprop… etc, then this approach will give nan values.

Thank you very much for your reply. I understood it well.

For visual viewing and understanding, I am currently plotting a gradient histogram with a tensor board. However, I don’t understand because I don’t have much prior knowledge.

Is there any way to plot the graph quantitatively for gradients? Thank you.

you can get parameters gradients using param.grad,

gradients_norms = []
for param in model.parameters():
    gradien_norms.append(param.grad.abs().sum())

Then you can plot these gradient norms.
In case of calling optimizer.zero_grad() after optimizer.step() the graph should decrease, since we are converging the loss functions.
But in case of calling optimizer.step() only, the graph should not decrease, in fact it should increase since we were adding the gradients for each update.

Thank you for your help.
It helped me a lot,