DDP device_ids in and world size 2 host 1 GPU per host

Hi Folks,

Does anyone know the logic of how the device id should be passed to DDP? In all examples it is multi-GPU, in my case, it is two hosts,s, and each host with a single GPU.

I keep hitting some sort of bug
File “/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/_functions.py”, line 123, in _get_stream
if _streams[device] is None:

Here is how master and work are initialized.

Master
world_size 2 since its two hosts and host 1 GPU

torch.distributed.init_process_group(
backend=self.trainer_spec.get_backend(),
init_method=self.trainer_spec.dist_url(),
world_size=2
rank=0)

Worker
torch.distributed.init_process_group(
backend=self.trainer_spec.get_backend(),
init_method=self.trainer_spec.dist_url(),
world_size=2
rank=1)

master
dev = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
model = DDP(model, device_ids=[0], output_device=0).to(dev)

worker
dev = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
model = DDP(model, device_ids=[0], output_device=0).to(dev)

model on GPU.

I enabled NCLL debug.
gpu10:5174:5174 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
gpu10:5174:5174 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.80.231<0>
gpu10:5174:5174 [0] NCCL INFO Using network Socket
gpu10:5174:5204 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
gpu10:5174:5204 [0] NCCL INFO Channel 00 : 0[b000] → 1[6000] [receive] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Channel 01 : 0[b000] → 1[6000] [receive] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Channel 00 : 1[6000] → 0[b000] [send] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Channel 01 : 1[6000] → 0[b000] [send] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Connected all rings
gpu10:5174:5204 [0] NCCL INFO Connected all trees
gpu10:5174:5204 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gpu10:5174:5204 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
gpu10:5174:5204 [0] NCCL INFO comm 0x7f1330002fc0 rank 1 nranks 2 cudaDev 0 busId 6000 - Init COMPLETE

It looks to me that distribute.py track id by internal torch ID while DDP accepts deves_id that must rank. So the logic is somewhere broken.

Spent a day so far no result.

Could you share the full stack trace and the error message you are running into? It is not clear from this stack what was the actual error that caused your script to fail.

Hi,

It looks like some of the variables read only for os.env down the line. So somewhere in torch or maybe NCCL read local rank variable.