This is more of a conceptual question since I recently learned about this optimization algorithm. All of the code below is just pseudocode (not actual python). We know that the classical momentum update is given by

``````    g(t) = mu * g(t-1) + grad loss(theta(t-1))
theta(t) = theta(t-1) - alpha * g(t)
``````

where `mu` is the momentum parameter and `alpha` is the learning rate.
The Nesterov Accelerated Gradient update is given by

``````    g(t) = mu * g(t-1) + grad loss(theta(t-1) - mu * alpha * g(t-1))
theta(t) = theta(t-1) - alpha * g(t)
``````

To make things somewhat simpler, we can define

``````theta_new(t) = theta(t) - alpha * mu * g(t)
``````

Then, the update equations above can be expressed as

``````        g(t) = mu * g(t-1) + grad loss(theta(t-1))
theta(t) = theta(t-1) - alpha * g(t)
theta_new(t) = theta(t) - alpha * mu * g(t)
= theta(t-1) - alpha * g(t) - alpha * mu * g(t)
= theta_new(t-1) + alpha * mu * g(t-1) - alpha * g(t) - alpha * mu * g(t)
= theta_new(t-1) - alpha * (g(t) - mu * g(t-1) + mu * g(t))
= theta_new(t-1) - alpha * (grad loss(theta_new(t-1)) + mu * g(t))
``````

I apologize if this notation is not standard: it’s just what we used in my deep learning class.

Unless I am mistaken, it appears that the PyTorch implemenation of SGD (documentation here: SGD — PyTorch 1.10 documentation) uses the update for `theta_new` directly on the model parameters `theta` without explicitly defining the relationship between the two. I know that the PyTorch implementation works, so I am wondering where the disconnect is between the derivation above and what is present in the source code. Thank you!

I know I am a month late but bumping this!