Distributed Training causes model to output NaN values after resuming from snapshot

I am training a custom model where I set my training hyperparameters to

  1. LR - 1e-5
  2. Batch Size - 8
  3. Weight Decay - 1e-6

My optimizer is AdamW. My training started with 2 GPUs and it ran for around 75 epochs, now I have 2 more GPUs (total 4) and I resume the training from the saved snapshot (containing the model checkpoint, optimizer and lr_scheduler states) with LR set to 1e-4 thinking that effective batch size is higher. After certain iterations my model gives all NaN values. I changed back to 2 GPUs with 1e-5 LR, and still no luck.