In a multi-GPU DDP environment, if the loss on one rank is NaN while the others are normal, could this cause the all-reduce to hang?

In AMP mixed-precision training, if NaNs are detected, the model parameters’ gradients become None. In this case, will the all-reduce operation be skipped, causing other ranks to wait indefinitely and eventually crash the program?

No, as answered in your cross post.