What is the behavior of using DistributedDataParallel without running the training dataset using the DistributedSampler?. Will it mean that the models are deployed on multiple GPUs , but they end up working on the same data?. I am sort of confused about the behavior. Would be good to have some clarification. Thanks
What is the behavior of using DistributedDataParallel without running the training dataset using the DistributedSampler? Will it mean that the models are deployed on multiple GPUs , but they end up working on the same data.
Yep, if you don’t use DistributedSampler or manually shard input data for each process, they will be working on the same data. In this case, every DDP instance in each process will end up with the same gradient in every iteration. As a result local gradients and synchronized global gradients will be the same, making DDP useless.