I’m now using ZeroRedundancyOptimizer with autocast mode and happen to find that the updated parameters would by synced across all ranks. However, in autocast mode, when inf/NaN gradients are found, the optimization step would be skipped. Would this arouse the hang up problem when a rank skip the optimization while other ranks still wait for syncronization?
def step( self, closure: Optional[Callable[, float]] = None, **kwargs: Any, ) -> Optional[float]: r""" Performs a single optimizer step and syncs parameters across all ranks. Arguments: closure (callable): a closure that re-evaluates the model and returns the loss; optional for most optimizers. Returns: Optional loss depending on the underlying local optimizer. .. note: Any extra parameters are passed to the base optimizer as-is. """ if self._overlap_with_ddp: logging.warning( "`step()` should not be included in the training loop when " "`overlap_with_ddp=True`" ) return None # Perform the local optimizer step loss = self._local_step(closure=closure, **kwargs) # Sync all of the updated parameter shards across the ranks self._sync_params() return loss