In the docs for the SGD
class that implements Stochastic Gradient Descent w/ Momentum, there’s a modification in the update rule and something I’m confused about when Nesterov Accelerated Gradient (NAG) is enabled. The docs mention:
PyTorch update:
v_t+1=μ∗vt +gt+1
pt+1=pt−lr∗v t+1
Typical momentum update:
v_t+1=μ∗vt + lr*gt+1
pt+1=pt−vt+1
What’s the reason for this change/difference?
Also the Nesterov application is said to be from the paper with this question:
vt+1 = µvt − ε∇f(θt + µvt) (3)
θt+1 = θt + vt+1 (4)
From this formula, how do you arrive at gt = gt + μvt?