Question about the implementation of Nesterov Accelerated Gradient in PyTorch

Entropy · March 9, 2022, 7:23am

This is more of a conceptual question since I recently learned about this optimization algorithm. All of the code below is just pseudocode (not actual python). We know that the classical momentum update is given by

    g(t) = mu * g(t-1) + grad loss(theta(t-1)) 
theta(t) = theta(t-1) - alpha * g(t)

where mu is the momentum parameter and alpha is the learning rate.
The Nesterov Accelerated Gradient update is given by

    g(t) = mu * g(t-1) + grad loss(theta(t-1) - mu * alpha * g(t-1)) 
theta(t) = theta(t-1) - alpha * g(t)

To make things somewhat simpler, we can define

theta_new(t) = theta(t) - alpha * mu * g(t)

Then, the update equations above can be expressed as

        g(t) = mu * g(t-1) + grad loss(theta(t-1)) 
    theta(t) = theta(t-1) - alpha * g(t) 
theta_new(t) = theta(t) - alpha * mu * g(t)
             = theta(t-1) - alpha * g(t) - alpha * mu * g(t)
             = theta_new(t-1) + alpha * mu * g(t-1) - alpha * g(t) - alpha * mu * g(t) 
             = theta_new(t-1) - alpha * (g(t) - mu * g(t-1) + mu * g(t)) 
             = theta_new(t-1) - alpha * (grad loss(theta_new(t-1)) + mu * g(t))

I apologize if this notation is not standard: it’s just what we used in my deep learning class.

Unless I am mistaken, it appears that the PyTorch implemenation of SGD (documentation here: SGD — PyTorch 1.10 documentation) uses the update for theta_new directly on the model parameters theta without explicitly defining the relationship between the two. I know that the PyTorch implementation works, so I am wondering where the disconnect is between the derivation above and what is present in the source code. Thank you!

Entropy · April 9, 2022, 7:13am

I know I am a month late but bumping this!