torch.optim.SGD and hand-written SGD parameter updates are slightly different

I’m finding that during some iterations of SGD, the torch.optim.SGD yields a slightly different parameter update when compared to a manually written update.

So far, it looks like an issue of precision to me:

I’m comparing: val.data - lr*val.grad.data
vs optimizer.step(); val.data

I’m comparing the norms of the weight matrices for simplicity (np.linalg.norm)

Here are two weight matrices which are QUITE close, but slightly different (this occurs a few steps into the 2nd epoch of SGD).
optim_new =
{‘B’: 0.842387760071843, ‘A’: 1.5769101725012014}

manual_new =
{‘B’: 0.842387760071843, ‘A’: 1.5769101725012018}

I’m finding that my manual implementation outperforms optim.step, perhaps due to some unexpected numerical instability.

Any thoughts on this? Note that I’m training an RNN to fit a chaotic dynamical system, so these kinds of sensitivities accumulate and start to matter more than expected.

Note: I found the following similar thread, which seems to have suffered from a different problem (Losses not matching when training using hand-written SGD and torch.optim.SGD)

UPDATE: I’m now finding that the discrepancy comes from the difference between:
val.data - lr*val.grad.data
and
val.data.add_(val.grad, alpha=-lr) # this is used by optim.SGD.step()

SECOND UPDATE: this thread has been helpful…

Which PyTorch version are you using on which OS?
I’m unsure, if this behavior wasn’t reproducible anymore at one point.

PyTorch 1.5.0
Python 3.7.6
macOS Catalina 10.15.5

Unfortunately I don’t have access to a macOS machine, so feel free to update the linked issue with your code and some information.