torch.optim.SGD and hand-written SGD parameter updates are slightly different

mattlevine22 · June 30, 2020, 8:55am

I’m finding that during some iterations of SGD, the torch.optim.SGD yields a slightly different parameter update when compared to a manually written update.

So far, it looks like an issue of precision to me:

I’m comparing: val.data - lr*val.grad.data
vs optimizer.step(); val.data

I’m comparing the norms of the weight matrices for simplicity (np.linalg.norm)

Here are two weight matrices which are QUITE close, but slightly different (this occurs a few steps into the 2nd epoch of SGD).
optim_new =
{‘B’: 0.842387760071843, ‘A’: 1.5769101725012014}

manual_new =
{‘B’: 0.842387760071843, ‘A’: 1.5769101725012018}

I’m finding that my manual implementation outperforms optim.step, perhaps due to some unexpected numerical instability.

Any thoughts on this? Note that I’m training an RNN to fit a chaotic dynamical system, so these kinds of sensitivities accumulate and start to matter more than expected.

Note: I found the following similar thread, which seems to have suffered from a different problem (Losses not matching when training using hand-written SGD and torch.optim.SGD)

UPDATE: I’m now finding that the discrepancy comes from the difference between:
val.data - lr*val.grad.data
and
val.data.add_(val.grad, alpha=-lr) # this is used by optim.SGD.step()

SECOND UPDATE: this thread has been helpful…

ptrblck · July 1, 2020, 12:33am

Which PyTorch version are you using on which OS?
I’m unsure, if this behavior wasn’t reproducible anymore at one point.

mattlevine22 · July 1, 2020, 12:45am

PyTorch 1.5.0
Python 3.7.6
macOS Catalina 10.15.5

ptrblck · July 1, 2020, 8:33am

Unfortunately I don’t have access to a macOS machine, so feel free to update the linked issue with your code and some information.

KDeja · November 6, 2022, 7:45am

Hi @ptrblck I see that the issue is dead but I noticed exactly the same thing. There are slight differences between handwritten SGD and pytorch SGD. Is there any mechanism used to prevent numerical errors in the optimizer that I’m not aware of?
Here is a snippet of the code to reproduce the issue.

import torch
import copy
h_sgd = torch.rand([100,100])
w = torch.rand([100,10])
lr = 1000
h_sgd.requires_grad = True
optim = torch.optim.SGD([h_sgd],lr=lr)
out = h_sgd@w
loss = torch.nn.CrossEntropyLoss()(out, torch.ones([100]).long())
loss.backward()
h_handwritten = copy.deepcopy((h_sgd - lr*h_sgd.grad).detach())
optim.step()
print((h_handwritten - h_sgd).std())

I’m using pytorch 1.12.1+cu113 on Ubuntu 18.04.6 LTS (Bionic Beaver). Seems like the difference is there both on CPU and GPU

ptrblck · November 6, 2022, 6:21pm

Relative errors in the range ~1e-6 are expected for float32 and in your case you are seeing:

print((h_handwritten - h_sgd).abs().max())
# tensor(4.7684e-07, grad_fn=<MaxBackward1>)

which also fits.
These errors are usually caused by a different order of operations and due to the limited floating point precision. Take a look at Wikipedia - Single-precision floating-point format for more general information.
Also note, that neither of these values is the “true” value and you cannot claim your custom implementation is the correct one while the built-in PyTorch method diverges. Both should show an initial error to the “exact” values (which are not representable in the used numerical format) and could later diverge entirely.
I would expect the noise to be random and haven’t seen any custom implementation, which saw a benefit when it was executed multiple times.

KDeja · November 7, 2022, 10:30am

Thank you for the answer, of course I don’t expect any of the ways to be better than another. In fact with exactly the same sequence of operation as in SGD (h_sgd = h_sgd.add_(h_sgd.grad, alpha=-lr)) I get the same values. Thanks for clarification!