I’m finding that during some iterations of SGD, the torch.optim.SGD yields a slightly different parameter update when compared to a manually written update.
So far, it looks like an issue of precision to me:
I’m comparing: val.data - lr*val.grad.data
vs optimizer.step(); val.data
I’m comparing the norms of the weight matrices for simplicity (np.linalg.norm)
Here are two weight matrices which are QUITE close, but slightly different (this occurs a few steps into the 2nd epoch of SGD).
optim_new =
{‘B’: 0.842387760071843, ‘A’: 1.5769101725012014}
manual_new =
{‘B’: 0.842387760071843, ‘A’: 1.5769101725012018}
I’m finding that my manual implementation outperforms optim.step, perhaps due to some unexpected numerical instability.
Any thoughts on this? Note that I’m training an RNN to fit a chaotic dynamical system, so these kinds of sensitivities accumulate and start to matter more than expected.
Note: I found the following similar thread, which seems to have suffered from a different problem (Losses not matching when training using hand-written SGD and torch.optim.SGD)
UPDATE: I’m now finding that the discrepancy comes from the difference between:
val.data - lr*val.grad.data
and
val.data.add_(val.grad, alpha=-lr) # this is used by optim.SGD.step()
SECOND UPDATE: this thread has been helpful…