This is more of a conceptual question since I recently learned about this optimization algorithm. All of the code below is just pseudocode (not actual python). We know that the classical momentum update is given by

```
g(t) = mu * g(t-1) + grad loss(theta(t-1))
theta(t) = theta(t-1) - alpha * g(t)
```

where `mu`

is the momentum parameter and `alpha`

is the learning rate.

The Nesterov Accelerated Gradient update is given by

```
g(t) = mu * g(t-1) + grad loss(theta(t-1) - mu * alpha * g(t-1))
theta(t) = theta(t-1) - alpha * g(t)
```

To make things somewhat simpler, we can define

```
theta_new(t) = theta(t) - alpha * mu * g(t)
```

Then, the update equations above can be expressed as

```
g(t) = mu * g(t-1) + grad loss(theta(t-1))
theta(t) = theta(t-1) - alpha * g(t)
theta_new(t) = theta(t) - alpha * mu * g(t)
= theta(t-1) - alpha * g(t) - alpha * mu * g(t)
= theta_new(t-1) + alpha * mu * g(t-1) - alpha * g(t) - alpha * mu * g(t)
= theta_new(t-1) - alpha * (g(t) - mu * g(t-1) + mu * g(t))
= theta_new(t-1) - alpha * (grad loss(theta_new(t-1)) + mu * g(t))
```

I apologize if this notation is not standard: it’s just what we used in my deep learning class.

Unless I am mistaken, it appears that the PyTorch implemenation of SGD (documentation here: SGD — PyTorch 1.10 documentation) uses the update for `theta_new`

directly on the model parameters `theta`

without explicitly defining the relationship between the two. I know that the PyTorch implementation works, so I am wondering where the disconnect is between the derivation above and what is present in the source code. Thank you!