SGD optimizer with momentum and optimizer.zero_grad()

I’m using SGD optimizer with momentum=0.9. Does the momentum work if I have optimizer.zero_grad() for each batch?
If we use momentum, we need to keep the gradients form the previous step, but optimizer.zero_grad() will clear the gradients from previous step. So I guess the momentum is useless here. Am I right?
For each batch in the training phase, The code looks as following:

optimizer.zero_grad()

loss.backward()
optimizer.step()

The SGD optimizer stores the calculated momentum separately from the gradients.
Running zero_grad() does not wipe the stored momentum.

1 Like