NaNs in loss temporarily go away after reloading checkpoint

After training for about 1000 iterations I start getting NaNs in my loss. I reloaded the most recent checkpoint (at 900 iterations) to debug. After starting training at 900 iterations I didn’t see NaNs in the loss again until around 2000 iterations. Reloaded the checkpoint at 1900 iterations and then saw NaNs again around 3000. Anyone have ideas on why this behavior could be occurring?

Your model seems to be diverging non-deterministically.
I also don’t know, if you are resetting e.g. the optimizer, which could also delay the divergence.