DataParallel vs DistributedDataParallel

Deeply · April 22, 2020, 4:42pm

Hi,

what is the difference between

model = nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])

and

model = nn.DataParallel(model, device_ids=[args.gpu])

?

mrshenli · April 22, 2020, 8:08pm

DistributedDataParallel is multi-process parallelism, where those processes can live on different machines. So, for model = nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]), this creates one DDP instance on one process, there could be other DDP instances from other processes in the same group working together with this DDP instance. Check out this https://pytorch.org/docs/master/notes/ddp.html

DataParallel is single-process multi-thread parallelism. It’s basically a wrapper of scatter + paralllel_apply + gather. For model = nn.DataParallel(model, device_ids=[args.gpu]), since it only works on a single device, it’s the same as just using the original model on GPU with id args.gpu. See https://github.com/pytorch/pytorch/blob/df8d6eeb19423848b20cd727bc4a728337b73829/torch/nn/parallel/data_parallel.py#L153

DataParallel is easier to use, as you don’t need additional code to setup process groups, and a one-line change should be sufficient to enable it.

DistributedDataParallel is faster and scalable. If you have multiple GPUs or machines and care about training speed, DistributedDataParallel should be the way to go.

Deeply · April 22, 2020, 8:47pm

But DataParallel also enable multiple GPUs, in one node/machine, right?

mrshenli · April 22, 2020, 9:00pm

Yes, but DataParallel cannot scale beyond one machine. It is slower than DistributedDataParallel even in a single machine with multiple GPUs due to GIL contention across multiple threads and the extra overhead introduced by scatter and gather and per-iteration model replication.

Denuwan_Weerarathne · September 22, 2023, 3:02am

Could you please elaborate on this “scatter and gather and per-iteration” model replication process of DP over DDP?