The documentation remarks that this differs from the original formula used by Sutskever et al. (download ) in terms of the application of the learning rate. However, it also differs by the fact that PyTorch subtracts the velocity from the parameter, instead of adding it.

The two formulations are equivalent. Letâ€™s call the â€śvelocityâ€ť in the
first, pytorch, formulation vPytorch_{t} and in your second proposed
version vPhysics_{t}. The two formulations only differ in a redefinition
of v, namely vPhysics_{t} = -vPytorch_{t}, that drops out of the
final calculation of p_{t}.

Now you might prefer to call the thing that you add to your â€śpositionâ€ť, p_{t}, a â€śvelocity,â€ť so you might prefer the second, vPhysics,
formulation. But that is purely a semantic or stylistic choice â€“ again,
the two formulations are mathematically equivalent.