SGD with momentum - why the formula change?

tom · October 30, 2021, 2:18am

The background is that while the two formulas are equivalent in the case of a fixed learning rate, they differ in how changing the learning rate (e.g. in a lr schedule) behaves: With given gradient magnitudes

in the original formula, it will reduce the magnitude of momentum updates and the size of the parameter updates will slowly be smaller, while
in the modified formula the momentum updates will stay the same and the parameter updates will be smaller immediately.

In other words, the change of learning rate can be thought of as also being applied to the existing momentum at the time of change. This turns out to be more intuitive when working with lr schedules.

Relative to the wording in the documentation, I think that more recently, other frameworks have also moved to the new formula.

Best regards

Thomas

P.S.: I once sat in a talk where they described porting from Torch7 (which also applied the lr like PyTorch does) to a framework that has the update rule and how they spent on and off weeks debugging why the network would not train well with the exact same training parameters. Turned out to be the discrepancy in momentum formulas.