Hi All,
It seems that pytorch’s way of implementing SGD with Nesterov is different to:
x_ahead = x + mu*v
v = learning_rate * (mu*v - dx_ahead)
x = x + v
Does pytorch’s implementation give the same answer as the one above?
As well are pytorch’s implementations of SGD+Momentum and SGD+Nesterov non adaptive learning rate algorithms?
Thanks!