rasbt
(Sebastian Raschka)
1
Hi all,
do I see it correctly in the source code that momentum in SGD is applied as
v = momentum * v + (1-dampening) * gradientW
W = W - lr * v
instead of the original?
v = momentum * v + (1-dampening) * lr * gradientW
W = W - v
Wouldn’t that result in applying the learning rate scaling twice to the past gradient? Is this intended or a bug?
ngimel
(ngimel)
2
This is intentional, see note https://github.com/pytorch/pytorch/blob/master/torch/optim/sgd.py#L28-L48. Formulations are equivalent for a constant learning rate, and the one used in pytorch makes step directly proportional to learning rate when learning rate changes.
rasbt
(Sebastian Raschka)
3
Ah thanks, my bad, should have scrolled to the documentation note