I’m now using ZeroRedundancyOptimizer with autocast mode and happen to find that the updated parameters would by synced across all ranks. However, in autocast mode, when inf/NaN gradients are found, the optimization step would be skipped. Would this arouse the hang up problem when a rank skip the optimization while other ranks still wait for syncronization?
def step(
self,
closure: Optional[Callable[[], float]] = None,
**kwargs: Any,
) -> Optional[float]:
r"""
Performs a single optimizer step and syncs parameters across all ranks.
Arguments:
closure (callable): a closure that re-evaluates the model and
returns the loss; optional for most optimizers.
Returns:
Optional loss depending on the underlying local optimizer.
.. note: Any extra parameters are passed to the base optimizer as-is.
"""
if self._overlap_with_ddp:
logging.warning(
"`step()` should not be included in the training loop when "
"`overlap_with_ddp=True`"
)
return None
# Perform the local optimizer step
loss = self._local_step(closure=closure, **kwargs)
# Sync all of the updated parameter shards across the ranks
self._sync_params()
return loss