When training my model, I can observe that at some point my loss jumps to a relatively high value for different training runs. The picture shows a good convergence until epoch 24.
What could be the reason for this behaviour?
I am training a transformer model on a masked reconstruction task with MSE loss. I use gradient norm clipping to prevent the gradients from exploding.
I am happy to provide more details about my training parameters if needed!
You probably have a wrong sample in your training set. Maybe preprocessing issue, overflow etc… Just track the loss and inspect the batch where it happens
However, as this only happened after a few epochs, the model has seen this sample several times. Why should this cause a problem later, but not earlier?
I’d say masked reconstruction typically works with random masking. Thus I’d doubt your dataloader is producing the same sample twice. But among all possibilities (dataloader issues, unstabilities in training etc…) the simplest one is looking at the data. Could be normalization issues too (zero divisions) which happen rarely but sometimes stars align.
Reproduce the leak couple of times and observe what’s wrong. If you find nothing strange it could be scheduler issues if you use a fancy one.
Once every 24 epochs that apparently happens quite often doesn’t seem too infrequent to me.
Most of the time I can restart the training and the same problem doesn’t happen again. This could indicate that there is a very low probability of producing a bad mask (however defined).
By “fancy scheduler” do you mean a learning rate scheduler?
Could this be a bad data + shuffling issue? In the sense, you have good data first and then at some point the “bad” data gets in the front?
Try to re-run the same script with fewer data point and check if you get the same issue.