Difference in implementation of SGD with Momentum/Nesterov

Sam-gege · October 6, 2021, 1:09pm

from the optim.SGD doc it says: “The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.” Does anyone know why? I’ve read some other sources and they all implement different than pytorch. Thanks!