In AMP mixed-precision training, if NaNs are detected, the model parameters’ gradients become None. In this case, will the all-reduce operation be skipped, causing other ranks to wait indefinitely and eventually crash the program?
No, as answered in your cross post.