DDP devices_ids and world_size and ranks multi-host setup

spyroot · May 22, 2022, 9:14am

Hi Folks,

I need a bit of help. I can’t find proper documentation on the logic of how multi-host
setup in DDP works. I see it mainly focusing on multi-GPU on a single host in all cases.

If you have two hosts and each host has one single GPU, what should the correct rank
on a worker node and, what device id should it pass to DDP, and what is the correct
format for device id? ( note set_cuda is integerger) set_device is “cuda:0” etc), so most of the tutorial
a bit broken.

For example, in this tutorial
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Notice that it passes rank as id. (now if you have a single host with nxGPU it does make sense) since
host 0 rank = 0 cuda:0.
host 0 rank = 1 cuda:1 ( so in DDP devices_ids[0,1]

But if you have two hosts, my understanding is we count world size from 1, not 0.
So two hosts and a single GPU per host = world_size 2

Hence on master node DDP devis_id[0] output device=0

// in all torch dock they pass rank but rank 1, not a device id.
Worker node rank 1 DDP must initlized devices_id[0] output device=0

So,
a) when model is created on worker device it must to(“cuda:0”)
b) DDP created with devices_id[0]

Right now I’m getting some strange CUDA error when the model starts a training loop.
RuntimeError: CUDA error: invalid device ordinal

Thank you,