Reason for modified momentum formulation

sid · January 28, 2018, 6:18pm

The docs in the source code for optim.SGD state the following:

    .. note::
        The implementation of SGD with Momentum/Nesterov subtly differs from
        Sutskever et. al. and implementations in some other frameworks.

        Considering the specific case of Momentum, the update can be written as

        .. math::
                  v = \rho * v + g \\
                  p = p - lr * v

        where p, g, v and :math:`\rho` denote the parameters, gradient,
        velocity, and momentum respectively.

        This is in contrast to Sutskever et. al. and
        other frameworks which employ an update of the form

        .. math::
             v = \rho * v + lr * g \\
             p = p - v

        The Nesterov version is analogously modified.

What’s the reason for this modification?

SimonW · January 28, 2018, 6:39pm

There are some discussion and reasonings here https://github.com/pytorch/pytorch/issues/1099