This is more of a conceptual question since I recently learned about this optimization algorithm. All of the code below is just pseudocode (not actual python). We know that the classical momentum update is given by
g(t) = mu * g(t-1) + grad loss(theta(t-1))
theta(t) = theta(t-1) - alpha * g(t)
where mu
is the momentum parameter and alpha
is the learning rate.
The Nesterov Accelerated Gradient update is given by
g(t) = mu * g(t-1) + grad loss(theta(t-1) - mu * alpha * g(t-1))
theta(t) = theta(t-1) - alpha * g(t)
To make things somewhat simpler, we can define
theta_new(t) = theta(t) - alpha * mu * g(t)
Then, the update equations above can be expressed as
g(t) = mu * g(t-1) + grad loss(theta(t-1))
theta(t) = theta(t-1) - alpha * g(t)
theta_new(t) = theta(t) - alpha * mu * g(t)
= theta(t-1) - alpha * g(t) - alpha * mu * g(t)
= theta_new(t-1) + alpha * mu * g(t-1) - alpha * g(t) - alpha * mu * g(t)
= theta_new(t-1) - alpha * (g(t) - mu * g(t-1) + mu * g(t))
= theta_new(t-1) - alpha * (grad loss(theta_new(t-1)) + mu * g(t))
I apologize if this notation is not standard: it’s just what we used in my deep learning class.
Unless I am mistaken, it appears that the PyTorch implemenation of SGD (documentation here: SGD — PyTorch 1.10 documentation) uses the update for theta_new
directly on the model parameters theta
without explicitly defining the relationship between the two. I know that the PyTorch implementation works, so I am wondering where the disconnect is between the derivation above and what is present in the source code. Thank you!