I’m interested in using SGD with momentum. In looking at the pseudocode on the SGD documentation site (SGD — PyTorch 2.6 documentation), I’m confused as to why the momentum parameter (momentum>0) is never touches the gradient when nesterov=False and dampening>=1.
Below is a screenshot of the pseudocode where the red arrows indicate (to my understanding) what lines would be executed should you define SGD to be:
Based on the pseudocode, it seems like momentum is not used at all under these parameter settings. Am I missing something or is the pseudocode incorrect?
The pseudocode looks sensible to me and is not obviously incorrect. (I don’t know
whether it agrees in detail with pytorch’s implementation.)
As I read it, on the first iteration (t = 1) the momentum, mu, is indeed not used. But
this makes sense because mu is basically a moving-average parameter. On the first
iteration, the “moving-average gradient” is just set equal to the gradient, because there
is not yet a previous value with which to average it.
On subsequent iterations (t > 1), we have b_t = mu b_t + (1 - tau) g_t. mu is
used and is specifying how much how much of the previous value of b_t is mixed into
the new value of b_t.
Is it possible that you are misreading tau, the dampening parameter, for t, the iteration
index used in the pseudocode?