The docs in the source code for optim.SGD state the following:
.. note:: The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks. Considering the specific case of Momentum, the update can be written as .. math:: v = \rho * v + g \\ p = p - lr * v where p, g, v and :math:`\rho` denote the parameters, gradient, velocity, and momentum respectively. This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form .. math:: v = \rho * v + lr * g \\ p = p - v The Nesterov version is analogously modified.
What’s the reason for this modification?