DDP device_ids in and world size 2 host 1 GPU per host

spyroot · May 21, 2022, 5:07pm

Hi Folks,

Does anyone know the logic of how the device id should be passed to DDP? In all examples it is multi-GPU, in my case, it is two hosts,s, and each host with a single GPU.

I keep hitting some sort of bug
File “/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/_functions.py”, line 123, in _get_stream
if _streams[device] is None:

Here is how master and work are initialized.

Master
world_size 2 since its two hosts and host 1 GPU

torch.distributed.init_process_group(
backend=self.trainer_spec.get_backend(),
init_method=self.trainer_spec.dist_url(),
world_size=2
rank=0)

Worker
torch.distributed.init_process_group(
backend=self.trainer_spec.get_backend(),
init_method=self.trainer_spec.dist_url(),
world_size=2
rank=1)

master
dev = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
model = DDP(model, device_ids=[0], output_device=0).to(dev)

worker
dev = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
model = DDP(model, device_ids=[0], output_device=0).to(dev)

model on GPU.

I enabled NCLL debug.
gpu10:5174:5174 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
gpu10:5174:5174 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.80.231<0>
gpu10:5174:5174 [0] NCCL INFO Using network Socket
gpu10:5174:5204 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
gpu10:5174:5204 [0] NCCL INFO Channel 00 : 0[b000] → 1[6000] [receive] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Channel 01 : 0[b000] → 1[6000] [receive] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Channel 00 : 1[6000] → 0[b000] [send] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Channel 01 : 1[6000] → 0[b000] [send] via NET/Socket/0
gpu10:5174:5204 [0] NCCL INFO Connected all rings
gpu10:5174:5204 [0] NCCL INFO Connected all trees
gpu10:5174:5204 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gpu10:5174:5204 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
gpu10:5174:5204 [0] NCCL INFO comm 0x7f1330002fc0 rank 1 nranks 2 cudaDev 0 busId 6000 - Init COMPLETE

It looks to me that distribute.py track id by internal torch ID while DDP accepts deves_id that must rank. So the logic is somewhere broken.

Spent a day so far no result.

pritamdamania87 · May 24, 2022, 5:33pm

Could you share the full stack trace and the error message you are running into? It is not clear from this stack what was the actual error that caused your script to fail.

spyroot · June 2, 2022, 5:04am

Hi,

It looks like some of the variables read only for os.env down the line. So somewhere in torch or maybe NCCL read local rank variable.

spyroot · June 2, 2022, 5:05am

github.com/pytorch/pytorch

DDP multi host with single GPU each.

opened 08:07PM - 21 May 22 UTC

spyroot

oncall: distributed triaged

### 🐛 Describe the bug Folks, I have two hosts, and each host has a single… GPU, I'm using an example. https://github.com/sudomaze/ttorch/blob/main/examples/ddp/run.py if I use the master node rank 0 , world_size 2 worker rank 1, world_size 2 if I use (master start training loop but worker never connected) rank 0 , world_size 1 rank 1 , world_size 1 Stack trace for case one. Note that master goes and waits only if world_size 2. ```gpu10:17709:17709 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] gpu10:17709:17709 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.80.231<0> gpu10:17709:17709 [0] NCCL INFO Using network Socket gpu10:17709:17729 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 gpu10:17709:17729 [0] NCCL INFO Channel 00 : 0[b000] -> 1[6000] [receive] via NET/Socket/0 gpu10:17709:17729 [0] NCCL INFO Channel 01 : 0[b000] -> 1[6000] [receive] via NET/Socket/0 gpu10:17709:17729 [0] NCCL INFO Channel 00 : 1[6000] -> 0[b000] [send] via NET/Socket/0 gpu10:17709:17729 [0] NCCL INFO Channel 01 : 1[6000] -> 0[b000] [send] via NET/Socket/0 gpu10:17709:17729 [0] NCCL INFO Connected all rings gpu10:17709:17729 [0] NCCL INFO Connected all trees gpu10:17709:17729 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 gpu10:17709:17729 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer gpu10:17709:17729 [0] NCCL INFO comm 0x7f7f2c002fc0 rank 1 nranks 2 cudaDev 0 busId 6000 - Init COMPLETE barrier released Traceback (most recent call last): File "/root/git/dtc_latest/dtc/ddp_test/ddp_sample.worker.py", line 235, in <module> init_process(1, world_size, run) File "/root/git/dtc_latest/dtc/ddp_test/ddp_sample.worker.py", line 227, in init_process fn(rank, world_size) File "/root/git/dtc_latest/dtc/ddp_test/ddp_sample.worker.py", line 158, in run torch.cuda.set_device(rank) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 313, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stack trace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. ``` Master node. ``` win11:15538:15538 [0] NCCL INFO Bootstrap : Using eth0:192.168.254.205<0> win11:15538:15538 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation win11:15538:15538 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] win11:15538:15538 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.254.205<0> win11:15538:15538 [0] NCCL INFO Using network Socket NCCL version 2.10.3+cuda11.5 win11:15538:15568 [0] NCCL INFO Channel 00/02 : 0 1 win11:15538:15568 [0] NCCL INFO Channel 01/02 : 0 1 win11:15538:15568 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 win11:15538:15568 [0] NCCL INFO Channel 00 : 1[6000] -> 0[b000] [receive] via NET/Socket/0 win11:15538:15568 [0] NCCL INFO Channel 01 : 1[6000] -> 0[b000] [receive] via NET/Socket/0 win11:15538:15568 [0] NCCL INFO Channel 00 : 0[b000] -> 1[6000] [send] via NET/Socket/0 win11:15538:15568 [0] NCCL INFO Channel 01 : 0[b000] -> 1[6000] [send] via NET/Socket/0 win11:15538:15568 [0] NCCL INFO Connected all rings win11:15538:15568 [0] NCCL INFO Connected all trees win11:15538:15568 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512 win11:15538:15568 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer win11:15538:15568 [0] NCCL INFO comm 0x7fa13c002fb0 rank 0 nranks 2 cudaDev 0 busId b000 - Init COMPLETE win11:15538:15538 [0] NCCL INFO Launch mode Parallel releasing 2 64 win11:15538:15570 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 172.16.80.231<36406> win11:15538:15570 [0] NCCL INFO transport/net_socket.cc:414 -> 2 win11:15538:15570 [0] NCCL INFO include/net.h:28 -> 2 win11:15538:15570 [0] NCCL INFO transport/net.cc:459 -> 2 win11:15538:15570 [0] NCCL INFO proxy.cc:351 -> 2 win11:15538:15570 [0] NCCL INFO proxy.cc:452 -> 2 [Proxy Thread] ``` I manage a bit narrow it down. This a case ``` master rank 0 local rank set 0 world size 2 device = "cuda:0" model.to(device) DDP(model, device_ids=[0], output_device=[0) worker device = "cuda:0" model.to(device) DDP(model, device_ids=[0], output_device=[0) ``` ``` RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stack trace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. ``` ### Versions Collecting environment information... PyTorch version: 1.11.0+cu115 Is debug build: False CUDA used to build PyTorch: 11.5 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04 LTS (x86_64) GCC version: (Ubuntu 11.2.0-19ubuntu1) 11.2.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35 Python version: 3.10.4 (main, Apr 2 2022, 09:04:19) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Nvidia driver version: 512.15 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] numpy==1.21.6 [pip3] torch==1.11.0+cu115 [pip3] torchaudio==0.11.0+cu115 [pip3] torchvision==0.12.0+cu115 [conda] Could not collect cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501