Learning rate applied twice for momentum when using SGD

Hi all,

do I see it correctly in the source code that momentum in SGD is applied as

v = momentum * v + (1-dampening) * gradientW
W = W - lr * v

instead of the original?

v = momentum * v + (1-dampening) * lr * gradientW
W = W - v

Wouldn’t that result in applying the learning rate scaling twice to the past gradient? Is this intended or a bug?

This is intentional, see note https://github.com/pytorch/pytorch/blob/master/torch/optim/sgd.py#L28-L48. Formulations are equivalent for a constant learning rate, and the one used in pytorch makes step directly proportional to learning rate when learning rate changes.

Ah thanks, my bad, should have scrolled to the documentation note :stuck_out_tongue: