@ptrblck this tutorial (Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.1.1+cu121 documentation) recommends to use DistributedDataParallel
even if we are in 1 machine. So the code if I want to use all GPUs would change form:
net = torch.nn.DataParallel(model, device_ids=list(range(torch.cuda.device_count())))
to
net = torch.nn.DistributedDataParallel(model, device_ids=list(range(torch.cuda.device_count())))
right? if I am using a single node with multiple GPUs there isn’t anything else/subtle I should do right?
Also if DistributedDataParallel
is so much better why does the interface for DataParallel
still exist? Doesn’t that make things more confusing for users?
quoting tutorial on why to use DistributedDataParallel
Comparison between
DataParallel
andDistributedDataParallel
Before we dive in, let’s clarify why, despite the added complexity, you would consider using DistributedDataParallel
over DataParallel
(Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.1.1+cu121 documentation) even with 1 single machine:
- First,
DataParallel
is single-process, multi-thread, and only works on a single machine, whileDistributedDataParallel
is multi-process and works for both single- and multi- machine training.DataParallel
is usually slower thanDistributedDataParallel
even on a single machine due to GIL contention across threads, per-iteration replicated model, and additional overhead introduced by scattering inputs and gathering outputs. - Recall from the prior tutorial that if your model is too large to fit on a single GPU, you must use model parallel to split it across multiple GPUs.
DistributedDataParallel
works with model parallel;DataParallel
does not at this time. When DDP is combined with model parallel, each DDP process would use model parallel, and all processes collectively would use data parallel. - If your model needs to span multiple machines or if your use case does not fit into data parallelism paradigm, please see the RPC API for more generic distributed training support.