In a multi-GPU DDP environment, if the loss on one rank is NaN while the others are normal, could this cause the all-reduce to hang?

xjm · November 12, 2025, 3:48am

In AMP mixed-precision training, if NaNs are detected, the model parameters’ gradients become None. In this case, will the all-reduce operation be skipped, causing other ranks to wait indefinitely and eventually crash the program?

ptrblck · November 12, 2025, 7:31pm

No, as answered in your cross post.