Distributed data parallel behavior

mrshenli · August 21, 2020, 5:46pm

Quote a recent discussion: Comparison Data Parallel Distributed data parallel

Please also see this brief note and this full paper

What is the behavior of using DistributedDataParallel without running the training dataset using the DistributedSampler? Will it mean that the models are deployed on multiple GPUs , but they end up working on the same data.

Yep, if you don’t use DistributedSampler or manually shard input data for each process, they will be working on the same data. In this case, every DDP instance in each process will end up with the same gradient in every iteration. As a result local gradients and synchronized global gradients will be the same, making DDP useless.