Nan values in loss after a few epochs

I was training Swin transformers using SimMIM using Huggingface’s implementation and have been using a custom SimMIM implementation.
After the warmup epochs, the losses either go to a fixed value and stay there, with no scope for convergence (equal predictions for all classes on the downstream task), or go to Nan. I’ve implemented gradient clipping and am using a small learning rate (1e-4) but these still haven’t fixed the issues. With a larger learning rate (8e-4), the losses diverge, then go to the fixed value mentioned earlier. This doesn’t happen with the ViT models, which converge without issues.
Can anyone please suggest how I can fix this? Thanks in advance.

Finally fixed this issue. Turns out, switching to FP32 from FP16 autocasting helped resolve the issues with convergence.