How to map processes to GPU in DDP and how to launch the DDP cluster?

Hi,

I have 2 EC2 machines with 4 GPUs each. That makes 8 GPUs in total.

I want to train a PyTorch DeepLab in data-parallel over those 8 cards. What should I do:

A. Launch a DDP training with 2 scripts/processes (1 per node), each doing torch.nn.DataParallel to data-parallel within node on the 4 cards

B. Launch a DDP training with 8 scripts/processes (1 per GPU), each executing pure DDP + PyTorch code and using only 1 GPU (leaving DDP doing the allreduces).

In both options: how to launch the processes with torchrun/launch.py/torch.distributed: once for the whole cluster, from some remote client? Once per node? Once per GPU?..

Hey @Olivier-CR

A. Launch a DDP training with 2 scripts/processes (1 per node), each doing torch.nn.DataParallel to data-parallel within node on the 4 cards

DataParallel is single-machine multi-GPU. It won’t work in the multi-machine scenario. DistributedDataParallel is the appropriate feature to use.

B. Launch a DDP training with 8 scripts/processes (1 per GPU), each executing pure DDP + PyTorch code and using only 1 GPU (leaving DDP doing the allreduces).

Yep, this should work. One caveat is that you need to make sure each DDP process exclusively operate on a dedicated GPU. You can do this by either setting the CUDA_VISIBLE_DEVICES to different GPU for different processes, or use the local_rank within each process. See: Distributed communication package - torch.distributed — PyTorch master documentation

In both options: how to launch the processes with torchrun/launch.py/torch.distributed: once for the whole cluster, from some remote client? Once per node? Once per GPU?..

If you are using torch run.py/launch.py, you just need to do it once for each machine. If you directly call your user script without run/launch, you will need to do that once for each GPU.

I do recommend using TorchElastic to launch jobs, as that will also provide failure recoveries.

Hey @Kiuk_Chung what’s the best tutorial to get started with run/launch? Thanks!

@mrshenli , @Olivier-CR the torch.distributed.run docs isn’t a tutorial but a great place to start: torchrun (Elastic Launch) — PyTorch 1.10.0 documentation

For a more out-of-the-box experience, TorchX is where we are trying to make distributed job launching much easier: Distributed — PyTorch/TorchX main documentation

1 Like

wow what is that torchX thing? some other new library again :slight_smile: ?! I though anything distributed would go to TorchElastic / Torchrun now? I’m a bit confused. Or is it just a thick client to launch distributed jobs?

torchrun will launch LOCAL processes for you. To run a distributed job its still on you to run torchrun on each of the nodes. TorchX has builtins to launch the job for you, and in doing so sets sensible defaults so that you don’t have to manually set configurations like --rdzv_backend, --rdzv_id etc.

Try:

$ pip install torchx-nightly
$ torchx run -s local_cwd dist.ddp -j 1x4 --script YOUR_SCRIPT.py <args to script>

Where -j is of the form {nnodes}x{nproc_per_node} so if you wanted to simulate a 2 node (each node running 4 procs per node) you’d set -j 2x4.

No need to worry about the different rendezvous settings now. Once you are ready to submit to a remote cluster (assuming you’ve setup kubernetes or slurm) then you’d run

$ torchx run -s kubernetes dist.ddp -j 1x4 --image YOUR_DOCKER_IMG --script SCRIPT.py <args to script>

@Kiuk_Chung what if you want to run via torchX on a plain EC2 cluster? still have to install etcd servers? “(needs you to start a local etcd server on port 2379! and have a python-etcd library installed)” (from here)