Pytorch Nesterov Implementation

Hi All,

It seems that pytorch’s way of implementing SGD with Nesterov is different to:

x_ahead = x + mu*v
v = learning_rate * (mu*v - dx_ahead)
x = x + v

Does pytorch’s implementation give the same answer as the one above?

As well are pytorch’s implementations of SGD+Momentum and SGD+Nesterov non adaptive learning rate algorithms?