Run Pytorch on Multiple GPUs

Brando_Miranda · November 11, 2020, 2:28pm

@ptrblck this tutorial (Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.1.1+cu121 documentation) recommends to use DistributedDataParallel even if we are in 1 machine. So the code if I want to use all GPUs would change form:

net = torch.nn.DataParallel(model, device_ids=list(range(torch.cuda.device_count())))

to

net = torch.nn.DistributedDataParallel(model, device_ids=list(range(torch.cuda.device_count())))

right? if I am using a single node with multiple GPUs there isn’t anything else/subtle I should do right?

Also if DistributedDataParallel is so much better why does the interface for DataParallel still exist? Doesn’t that make things more confusing for users?

quoting tutorial on why to use DistributedDataParallel

Comparison between DataParallel and DistributedDataParallel

Before we dive in, let’s clarify why, despite the added complexity, you would consider using DistributedDataParallel over DataParallel (Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.1.1+cu121 documentation) even with 1 single machine:

First, DataParallel is single-process, multi-thread, and only works on a single machine, while DistributedDataParallel is multi-process and works for both single- and multi- machine training. DataParallel is usually slower than DistributedDataParallel even on a single machine due to GIL contention across threads, per-iteration replicated model, and additional overhead introduced by scattering inputs and gathering outputs.
Recall from the prior tutorial that if your model is too large to fit on a single GPU, you must use model parallel to split it across multiple GPUs. DistributedDataParallel works with model parallel; DataParallel does not at this time. When DDP is combined with model parallel, each DDP process would use model parallel, and all processes collectively would use data parallel.
If your model needs to span multiple machines or if your use case does not fit into data parallelism paradigm, please see the RPC API for more generic distributed training support.