I’m finding that during some iterations of SGD, the torch.optim.SGD yields a slightly different parameter update when compared to a manually written update.

So far, it looks like an issue of precision to me:

I’m comparing: val.data - lr*val.grad.data
vs optimizer.step(); val.data

I’m comparing the norms of the weight matrices for simplicity (np.linalg.norm)

Here are two weight matrices which are QUITE close, but slightly different (this occurs a few steps into the 2nd epoch of SGD).
optim_new =
{‘B’: 0.842387760071843, ‘A’: 1.5769101725012014}

I’m finding that my manual implementation outperforms optim.step, perhaps due to some unexpected numerical instability.

Any thoughts on this? Note that I’m training an RNN to fit a chaotic dynamical system, so these kinds of sensitivities accumulate and start to matter more than expected.

UPDATE: I’m now finding that the discrepancy comes from the difference between:
val.data - lr*val.grad.data
and
val.data.add_(val.grad, alpha=-lr) # this is used by optim.SGD.step()

Hi @ptrblck I see that the issue is dead but I noticed exactly the same thing. There are slight differences between handwritten SGD and pytorch SGD. Is there any mechanism used to prevent numerical errors in the optimizer that I’m not aware of?
Here is a snippet of the code to reproduce the issue.

import torch
import copy
h_sgd = torch.rand([100,100])
w = torch.rand([100,10])
lr = 1000
h_sgd.requires_grad = True
optim = torch.optim.SGD([h_sgd],lr=lr)
out = h_sgd@w
loss = torch.nn.CrossEntropyLoss()(out, torch.ones([100]).long())
loss.backward()
h_handwritten = copy.deepcopy((h_sgd - lr*h_sgd.grad).detach())
optim.step()
print((h_handwritten - h_sgd).std())

I’m using pytorch 1.12.1+cu113 on Ubuntu 18.04.6 LTS (Bionic Beaver). Seems like the difference is there both on CPU and GPU

which also fits.
These errors are usually caused by a different order of operations and due to the limited floating point precision. Take a look at Wikipedia - Single-precision floating-point format for more general information.
Also note, that neither of these values is the “true” value and you cannot claim your custom implementation is the correct one while the built-in PyTorch method diverges. Both should show an initial error to the “exact” values (which are not representable in the used numerical format) and could later diverge entirely.
I would expect the noise to be random and haven’t seen any custom implementation, which saw a benefit when it was executed multiple times.

Thank you for the answer, of course I don’t expect any of the ways to be better than another. In fact with exactly the same sequence of operation as in SGD (h_sgd = h_sgd.add_(h_sgd.grad, alpha=-lr)) I get the same values. Thanks for clarification!