Here are two graphs of training my model with only 1 data sample. SGD is perfectly monotonic, while Adam has these weird oscillations. Why is this?
After creating the model, I saved
model.state_dict(), trained with SGD, then reloaded the original state dict, and finally trained with Adam.
For reference, here’s 100 samples (batch_size = 1):
Prompted by this: Training with batch_size = 1, all outputs are the same and trains poorly