Hi,
what is the difference between
model = nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
and
model = nn.DataParallel(model, device_ids=[args.gpu])
?
Hi,
what is the difference between
model = nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
and
model = nn.DataParallel(model, device_ids=[args.gpu])
?
DistributedDataParallel
is multi-process parallelism, where those processes can live on different machines. So, for model = nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
, this creates one DDP instance on one process, there could be other DDP instances from other processes in the same group working together with this DDP instance. Check out this https://pytorch.org/docs/master/notes/ddp.html
DataParallel
is single-process multi-thread parallelism. It’s basically a wrapper of scatter + paralllel_apply + gather. For model = nn.DataParallel(model, device_ids=[args.gpu])
, since it only works on a single device, it’s the same as just using the original model on GPU with id args.gpu
. See https://github.com/pytorch/pytorch/blob/df8d6eeb19423848b20cd727bc4a728337b73829/torch/nn/parallel/data_parallel.py#L153
DataParallel
is easier to use, as you don’t need additional code to setup process groups, and a one-line change should be sufficient to enable it.
DistributedDataParallel
is faster and scalable. If you have multiple GPUs or machines and care about training speed, DistributedDataParallel
should be the way to go.
But DataParallel
also enable multiple GPUs, in one node/machine, right?
Yes, but DataParallel
cannot scale beyond one machine. It is slower than DistributedDataParallel
even in a single machine with multiple GPUs due to GIL contention across multiple threads and the extra overhead introduced by scatter and gather and per-iteration model replication.
Could you please elaborate on this “scatter and gather and per-iteration” model replication process of DP over DDP?