Intuition Behind Nesterov Update Rule for SGD w/ Momentum

Andrew_Holmes · April 19, 2024, 1:09am

In the docs for the SGD class that implements Stochastic Gradient Descent w/ Momentum, there’s a modification in the update rule and something I’m confused about when Nesterov Accelerated Gradient (NAG) is enabled. The docs mention:

PyTorch update:
v_t+1=μ∗vt +gt+1
pt+1=pt−lr∗v t+1

Typical momentum update:
v_t+1=μ∗vt + lr*gt+1
pt+1=pt−vt+1

What’s the reason for this change/difference?

Also the Nesterov application is said to be from the paper with this question:
vt+1 = µvt − ε∇f(θt + µvt) (3)
θt+1 = θt + vt+1 (4)

From this formula, how do you arrive at gt = gt + μvt?

Document reference