Comparison Data Parallel Distributed data parallel

There are some comparison between DP and DDP here: PyTorch Distributed Overview — PyTorch Tutorials 2.1.1+cu121 documentation

  1. What is the “reduce” mean. The “reduce” is the weight update or loss reduction.

What’s the context here? If you mean all_reduce, it is a collective communication operation. DDP uses it to synchronize gradients. see https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allreduce

  1. What is the major difference between DP and DDP in the weight update strategy? I think this is important.

Weight update is done by the optimizer, so if you are using the same optimizer the weight update strategy should be the same. The difference between DP and DDP is how they handle gradients. DP accumulates gradients to the same .grad field, while DDP first use all_reduce to calculate the gradient sum across all processes and divide that by world_size to compute the mean. More details can be found in this paper.

The above difference has impact on how lr should be configured. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel

  1. DDP affects the batch normalization (BN) or DDP still needs the synchronized BN.
    Thank you for reading my question.

By default, DDP will broadcast buffers from rank 0 to all other ranks, so yes, it does affect BN.

BTW, for distributed training related questions, could you please add a “distributed” tag to the post? There is a oncall team monitoring that tag.

1 Like