All_reduce is not necessary when I use DistributedDataParallel?

jinserk · November 1, 2018, 10:49pm

Hi all,

I wonder how to reduce the params of the model when I use torch.nn.parallel.DistributedDataParallel. Looking the code here, there is no explicit all_reduce in the training loop. According to the docs for this module there is a note for this as following:

Parameters are never broadcast between processes. The module performs an all-reduce step on gradients and assumes that they will be modified by the optimizer in all processes in the same way. Buffers (e.g. BatchNorm stats) are broadcast from the module in process of rank 0, to all other replicas in the system in every iteration.

But it’s unclear that the module performs all-reduce automatically or not. What’s the difference of broadcasting the parameters and broadcasting the gradients?

SimonW · November 2, 2018, 3:55am

The logic is handled in DistributedDataParallel class at https://github.com/pytorch/pytorch/blob/57e162da56f83f04ed744caba1b3819a2ca8c86a/torch/nn/parallel/distributed.py#L291-L385

gurunath · January 23, 2019, 10:50am

From what I understand, in Distributed Data Parallel, the gradients are averaged and sent to all other rank1,rank2, … devices from rank0 process. Consider a case that 3 GPUs are used: rank0 process completes its 1st iteration at time T=10ms from the start, rank1 process completes its 1st iteration at time T=13ms from the start and rank2 completes its 1st iteration at T=12ms. Then will rank0 and rank2 processes wait for rank1 process till T=13ms so that rank0 can do all_reduce and broadcast the gradients to rank1 and rank2 processes? Does it happen synchronously/asynchronously? Is there any detailed explanation? Thanks in advance.