Learning rate applied twice for momentum when using SGD

rasbt · March 29, 2019, 2:23pm

Hi all,

do I see it correctly in the source code that momentum in SGD is applied as

v = momentum * v + (1-dampening) * gradientW
W = W - lr * v

instead of the original?

v = momentum * v + (1-dampening) * lr * gradientW
W = W - v

Wouldn’t that result in applying the learning rate scaling twice to the past gradient? Is this intended or a bug?

ngimel · March 29, 2019, 3:58pm

This is intentional, see note https://github.com/pytorch/pytorch/blob/master/torch/optim/sgd.py#L28-L48. Formulations are equivalent for a constant learning rate, and the one used in pytorch makes step directly proportional to learning rate when learning rate changes.

rasbt · March 29, 2019, 7:55pm

Ah thanks, my bad, should have scrolled to the documentation note