# Reason for modified momentum formulation

The docs in the source code for optim.SGD state the following:

    .. note::
The implementation of SGD with Momentum/Nesterov subtly differs from
Sutskever et. al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

.. math::
v = \rho * v + g \\
p = p - lr * v

where p, g, v and :math:\rho denote the parameters, gradient,
velocity, and momentum respectively.

This is in contrast to Sutskever et. al. and
other frameworks which employ an update of the form

.. math::
v = \rho * v + lr * g \\
p = p - v

The Nesterov version is analogously modified.


What’s the reason for this modification?

There are some discussion and reasonings here https://github.com/pytorch/pytorch/issues/1099