SGD with momentum pseudocode error?

belsten · March 23, 2025, 6:03pm

Hello,

I’m interested in using SGD with momentum. In looking at the pseudocode on the SGD documentation site (SGD — PyTorch 2.6 documentation), I’m confused as to why the momentum parameter (momentum>0) is never touches the gradient when nesterov=False and dampening>=1.

Below is a screenshot of the pseudocode where the red arrows indicate (to my understanding) what lines would be executed should you define SGD to be:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

Based on the pseudocode, it seems like momentum is not used at all under these parameter settings. Am I missing something or is the pseudocode incorrect?

KFrank · March 24, 2025, 9:13pm

Hi Belsten!

The pseudocode looks sensible to me and is not obviously incorrect. (I don’t know
whether it agrees in detail with pytorch’s implementation.)

As I read it, on the first iteration (t = 1) the momentum, mu, is indeed not used. But
this makes sense because mu is basically a moving-average parameter. On the first
iteration, the “moving-average gradient” is just set equal to the gradient, because there
is not yet a previous value with which to average it.

On subsequent iterations (t > 1), we have b_t = mu b_t + (1 - tau) g_t. mu is
used and is specifying how much how much of the previous value of b_t is mixed into
the new value of b_t.

Is it possible that you are misreading tau, the dampening parameter, for t, the iteration
index used in the pseudocode?

Best.

K. Frank