Why do SGD and Adam behave so differently for a dataset with 1 sample?

Here are two graphs of training my model with only 1 data sample. SGD is perfectly monotonic, while Adam has these weird oscillations. Why is this?

After creating the model, I saved model.state_dict(), trained with SGD, then reloaded the original state dict, and finally trained with Adam.

For reference, here’s 100 samples (batch_size = 1):


Prompted by this: Training with batch_size = 1, all outputs are the same and trains poorly - #8 by agt