Hi Folks,
Does anyone know the logic of how the device id should be passed to DDP? In all examples it is multi-GPU, in my case, it is two hosts,s, and each host with a single GPU.
I keep hitting some sort of bug
File “/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/_functions.py”, line 123, in _get_stream
if _streams[device] is None:
Here is how master and work are initialized.
Master
world_size 2 since its two hosts and host 1 GPU
torch.distributed.init_process_group(
backend=self.trainer_spec.get_backend(),
init_method=self.trainer_spec.dist_url(),
world_size=2
rank=0)
Worker
torch.distributed.init_process_group(
backend=self.trainer_spec.get_backend(),
init_method=self.trainer_spec.dist_url(),
world_size=2
rank=1)
master
dev = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
model = DDP(model, device_ids=[0], output_device=0).to(dev)
worker
dev = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
model = DDP(model, device_ids=[0], output_device=0).to(dev)
model on GPU.
I enabled NCLL debug.
gpu10:5174:5174 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
gpu10:5174:5174 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.80.231<0>
gpu10:5174:5174 [0] NCCL INFO Using network Socket
gpu10:5174:5204 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
gpu10:5174:5204 [0] NCCL INFO Channel 00 : 0[b000] → 1[6000] [receive] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Channel 01 : 0[b000] → 1[6000] [receive] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Channel 00 : 1[6000] → 0[b000] [send] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Channel 01 : 1[6000] → 0[b000] [send] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Connected all rings
gpu10:5174:5204 [0] NCCL INFO Connected all trees
gpu10:5174:5204 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gpu10:5174:5204 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
gpu10:5174:5204 [0] NCCL INFO comm 0x7f1330002fc0 rank 1 nranks 2 cudaDev 0 busId 6000 - Init COMPLETE
It looks to me that distribute.py track id by internal torch ID while DDP accepts deves_id that must rank. So the logic is somewhere broken.
Spent a day so far no result.