the documentation to the SGD with momentum method emphasizes that PyTorch uses a different iteration scheme compared to the original one introduced in the scientific literature. Is there a reason for this?

I care since I am playing with an algorithm that builds on the original momentum method and I would like to use the latter instead of PyTorch’s version. I can probably just edit the optimizer source code myself, but I was wondering about the reason behind the change.

The background is that while the two formulas are equivalent in the case of a fixed learning rate, they differ in how changing the learning rate (e.g. in a lr schedule) behaves: With given gradient magnitudes

in the original formula, it will reduce the magnitude of momentum updates and the size of the parameter updates will slowly be smaller, while

in the modified formula the momentum updates will stay the same and the parameter updates will be smaller immediately.

In other words, the change of learning rate can be thought of as also being applied to the existing momentum at the time of change. This turns out to be more intuitive when working with lr schedules.

Relative to the wording in the documentation, I think that more recently, other frameworks have also moved to the new formula.

Best regards

Thomas

P.S.: I once sat in a talk where they described porting from Torch7 (which also applied the lr like PyTorch does) to a framework that has the update rule and how they spent on and off weeks debugging why the network would not train well with the exact same training parameters. Turned out to be the discrepancy in momentum formulas.

Thank you Thomas for the explanation. The reason does indeed make sense.

I tried to verify your claim that the two methods (for fixed learning rate) are equivalent, but it seems like this can only be achieved by rescaling the velocity for the Torch scheme:

Let p_t be a current parameter. Let lr1, u1, and v1 be learning rate, momentum, and velocity for the original scheme, and lr2, u2, and v2 the learning rate, momentum, and velocity for the PyTorch version. Let G_{t} be the gradient at time t.

The original scheme goes: p_{t+1} = p_{t} - v1_{t+1} = p_{t} - u1 v1_{t} - lr1 G_{t+1}
The PyTorch scheme goes: p_{t+1} = p_{t} - lr2 v2_{t+1} = p_{t} - lr2 u2 v2_{t} - lr2 G_{t+1}

Equating the two expressions leads to
p_{t} - u1 v1_{t} - lr1 G_{t+1} = p_{t} - lr2 u2 v2_{t} - lr2 G_{t+1},
in other words
lr1 = lr2
and
u1 v1_{t} = lr2 u2 v2_{t}.

If the velocities in the two schemes were the same, i.e. v1 = v2, the last equation becomes u1 = lr2 u2 or u2 = u1/lr2. With a legit choice for learning rate and u1, this can easily lead to u2 > 1, which is forbidden. Then, the two iteration schemes cannot be equivalent.
So instead of having v1 = v2, one can take v1 = lr v2 and u1 = u2.

This means that the velocities in the two methods are scaled differently. Is this correct?

Yes, the “PyTorch method” applies the learning rate after computing the velocity, the original Sutskever et al method applies the learning rate before computing the velocity.