Distributed data parallel behavior

Trinayan_Baruah · August 21, 2020, 3:25pm

Hi,

What is the behavior of using DistributedDataParallel without running the training dataset using the DistributedSampler?. Will it mean that the models are deployed on multiple GPUs , but they end up working on the same data?. I am sort of confused about the behavior. Would be good to have some clarification. Thanks

mrshenli · August 21, 2020, 5:46pm

Hey @Trinayan_Baruah

Quote a recent discussion: Comparison Data Parallel Distributed data parallel

Please also see this brief note and this full paper

What is the behavior of using DistributedDataParallel without running the training dataset using the DistributedSampler? Will it mean that the models are deployed on multiple GPUs , but they end up working on the same data.

Yep, if you don’t use DistributedSampler or manually shard input data for each process, they will be working on the same data. In this case, every DDP instance in each process will end up with the same gradient in every iteration. As a result local gradients and synchronized global gradients will be the same, making DDP useless.

Trinayan_Baruah · August 26, 2020, 4:37pm

Ok . thank you. it makes sense