SGD optimizer with momentum and optimizer.zero_grad()

Wei_Wang · December 27, 2017, 2:48pm

I’m using SGD optimizer with momentum=0.9. Does the momentum work if I have optimizer.zero_grad() for each batch?
If we use momentum, we need to keep the gradients form the previous step, but optimizer.zero_grad() will clear the gradients from previous step. So I guess the momentum is useless here. Am I right?
For each batch in the training phase, The code looks as following:
…
optimizer.zero_grad()
…
loss.backward()
optimizer.step()
…

jpeg729 · December 27, 2017, 3:02pm

The SGD optimizer stores the calculated momentum separately from the gradients.
Running zero_grad() does not wipe the stored momentum.