SGD with momentum pseudocode error?

Hello,

I’m interested in using SGD with momentum. In looking at the pseudocode on the SGD documentation site (SGD — PyTorch 2.6 documentation), I’m confused as to why the momentum parameter (momentum>0) is never touches the gradient when nesterov=False and dampening>=1.

Below is a screenshot of the pseudocode where the red arrows indicate (to my understanding) what lines would be executed should you define SGD to be:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

Based on the pseudocode, it seems like momentum is not used at all under these parameter settings. Am I missing something or is the pseudocode incorrect?

Hi Belsten!

The pseudocode looks sensible to me and is not obviously incorrect. (I don’t know
whether it agrees in detail with pytorch’s implementation.)

As I read it, on the first iteration (t = 1) the momentum, mu, is indeed not used. But
this makes sense because mu is basically a moving-average parameter. On the first
iteration, the “moving-average gradient” is just set equal to the gradient, because there
is not yet a previous value with which to average it.

On subsequent iterations (t > 1), we have b_t = mu b_t + (1 - tau) g_t. mu is
used and is specifying how much how much of the previous value of b_t is mixed into
the new value of b_t.

Is it possible that you are misreading tau, the dampening parameter, for t, the iteration
index used in the pseudocode?

Best.

K. Frank