How does pytorch DistributedDataParallel synchronize gradient calculation between different nodes?

I know pytorch DistributedDataParallel use torch.distributed.Reducer class to ensure different GPUs finish gradients calculation in one node and then start ring all reduce process.But how to ensure other nodes finish gradients calculation?For example,there are three nodes.The first nodes finished gradient calculation and prepared to start all reduce.How does the first node know whether other nodes finished gradient calculation?