Difference in implementation of SGD with Momentum/Nesterov

from the optim.SGD doc it says: “The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.” Does anyone know why? I’ve read some other sources and they all implement different than pytorch. Thanks!