ADAM optimizer not applying gradient correctly

I’m using the ADAM optimizer for a SNN simple XOR test appl, although the “loss.backward()” computes valid gradient values, the “optimizer.step()” is not applying the expected parameter updates of “gradient * Learning_rate”. The updated values are much larger than the gradient value by itself. What I see is for the 1st batch update, is 0.1 (the learning rate) for all non-zero gradients. The following updates are larger than the gradient values and sometimes are the opposite direction. Has anyone else seen this behavior ?


Adam uses running internal states and does not use the plain SGD update as gradient * learning_rate.
The update rule is explained in the docs as well as the paper.