Loss calculation of SGD with Nesterov

gilbertocunha · February 3, 2021, 2:44pm

So from my knowledge nesterov momentum should look something like this:

According to this formula, the loss function should be calculated not according to our model’s parameters, but with a prediction of the future model’s parameters.

And using a torch SGD optimizer with Nesterov should look like the following:

optimizer = torch.optim.SGD(..., nesterov=True)
optimizer.zero_grad()
loss_fn(model(input), target).backward()
optimizer.step()

My question is how can optimizer.step() apply the correct parameter update (in the Nesterov case) if the loss is calculated according to the model’s present parameters and not “future” parameter predictions as in the equation above? I already looked at the source code and I still do not understand how this was done.

Does PyTorch call the forward method of the model and somehow overwrites its parameters in model(input)? If that is the case, how is that done?

albanD · February 3, 2021, 4:24pm

Hi,

I am not 100% sure but I think the trick is to have the parameters actually contain \theta_t - \mu v_t. That way a regular forward/backawrd evaluates the right value and you only need to update the step formula to take this into account.

gilbertocunha · February 3, 2021, 5:43pm

Thanks for the response

I guess it does make some sense that way, but those parameters would have to be changed before the model(input) instruction, or while that instruction is being evaluated, I guess.

Also not sure if the parameters should not be reverted back before calling backward(), though, would that not have an impact on the gradient calculation?

albanD · February 3, 2021, 7:28pm

You compute the gradient wrt to the weights, you added just a constant value to them, so no it wouldn’t impact the gradient computation.

gilbertocunha · February 3, 2021, 7:42pm

good point, silly me ahaha

Thank you so much for your help!
Even if this isn’t the way it is implementated in the source code it certainly seems like an implementable solution!

ado_sar · February 22, 2025, 9:36pm

Can you elaborate how PyTorch calculates the gradient for future model’s parameters?

By reading the pseudocode:

I don’t see how PyTorch achieve this.