Hi, I’m trying to run a simple distributed PyTorch job across using GPU/NCCL across 2 g4dn.xlarge
nodes. The process group seems to initialize fine, but when trying to wrap the model in DDP there is a NCCL connection error.
Failure point:
model = DistributedDataParallel(model, device_ids=[rank], output_device=rank)
Environment:
- Torch: 1.9.0+cu111
- NCCL: 2.7.8
Logs with NCCL_DEBUG=INFO
:
Rank 0:
[0m ip-172-31-0-137:14605:14673 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.137<0> [1]veth75fa783:fe80::14ad:64ff:fe64:b31e%veth75fa783<0>
[0m ip-172-31-0-137:14605:14673 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[0m
[0m ip-172-31-0-137:14605:14673 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
[0m ip-172-31-0-137:14605:14673 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.137<0> [1]veth75fa783:fe80::14ad:64ff:fe64:b31e%veth75fa783<0>
[0m ip-172-31-0-137:14605:14673 [0] NCCL INFO Using network Socket
[0m NCCL version 2.7.8+cuda11.1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 00/04 : 0 1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 01/04 : 0 1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 02/04 : 0 1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 03/04 : 0 1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1 [2] -1/-1/-1->0->1|1->0->-1/-1/-1 [3] -1/-1/-1->0->1|1->0->-1/-1/-1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 00 : 1[1e0] -> 0[1e0] [receive] via NET/Socket/0
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 00 : 0[1e0] -> 1[1e0] [send] via NET/Socket/0
Rank 1:
[0m ip-172-31-86-69:1538:1569 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.86.69<0> [1]veth0bd9843:fe80::6c37:83ff:fe11:cbd6%veth0bd9843<0>
[0m ip-172-31-86-69:1538:1569 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[0m
[0m ip-172-31-86-69:1538:1569 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
[0m ip-172-31-86-69:1538:1569 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.86.69<0> [1]veth0bd9843:fe80::6c37:83ff:fe11:cbd6%veth0bd9843<0>
[0m ip-172-31-86-69:1538:1569 [0] NCCL INFO Using network Socket
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1 [2] 0/-1/-1->1->-1|-1->1->0/-1/-1 [3] 0/-1/-1->1->-1|-1->1->0/-1/-1
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Channel 00 : 0[1e0] -> 1[1e0] [receive] via NET/Socket/0
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Channel 00 : 1[1e0] -> 0[1e0] [send] via NET/Socket/0
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Channel 01 : 0[1e0] -> 1[1e0] [receive] via NET/Socket/1
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Channel 01 : 1[1e0] -> 0[1e0] [send] via NET/Socket/1
[0m
[0m ip-172-31-86-69:1538:1572 [0] include/socket.h:403 NCCL WARN Connect to fe80::14ad:64ff:fe64:b31e%7<54043> failed : Network is unreachable
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO transport/net_socket.cc:313 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO include/net.h:21 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO transport/net.cc:161 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO transport.cc:68 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO init.cc:766 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO init.cc:840 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
Rank 0:
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 01 : 1[1e0] -> 0[1e0] [receive] via NET/Socket/1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 01 : 0[1e0] -> 1[1e0] [send] via NET/Socket/1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m
[0m ip-172-31-0-137:14605:14688 [0] include/socket.h:403 NCCL WARN Connect to 172.31.86.69<37817> failed : Connection refused
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO bootstrap.cc:95 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO bootstrap.cc:363 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO transport.cc:59 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO init.cc:766 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO init.cc:840 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
Final Output:
File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
A few other things worth mentioning:
- The script runs fine with Gloo.
- The script runs when downgrading to Torch 1.60 with CUDA 10.2 and NCCL 2.4.8.
Does anyone have any ideas or suggestions on how to debug this? Thanks in advance!