Asynchronous Allreduce gradients

Hello, I want to know how the MPI_Allreduce works in asynchronous mode when the gradients are calculated. Suppose we have 3 processes. If the first epoch is finished and only one process have update the gradients, when it takes the gradients from a shared buffer it takes NaN in the SUM of the process that havent finished ?? Im pretty lost here because Allreduce is a blocking primitive but the training doesnt stop for it.

What do you mean by “training doesn’t stop”?

Also, how do you run allreduce in asynchronous mode? The synchronization done by torch.nn.parallel.DistributedDataParallel is done implicitly, when you make autograd compute gradients for your model parameters. It doesn’t return until all the allreduce calls have finished (or in the case of CUDA tensors, until all the NCCL allreduce kernels have been queued).