NCCL Network is unreachable / Connection refused when initializing DDP

Hi, I’m trying to run a simple distributed PyTorch job across using GPU/NCCL across 2 g4dn.xlarge nodes. The process group seems to initialize fine, but when trying to wrap the model in DDP there is a NCCL connection error.

Failure point:

 model = DistributedDataParallel(model, device_ids=[rank], output_device=rank)

Environment:

  • Torch: 1.9.0+cu111
  • NCCL: 2.7.8

Logs with NCCL_DEBUG=INFO:

Rank 0:
[0m ip-172-31-0-137:14605:14673 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.0.137<0> [1]veth75fa783:fe80::14ad:64ff:fe64:b31e%veth75fa783<0>
[0m ip-172-31-0-137:14605:14673 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[0m 
[0m ip-172-31-0-137:14605:14673 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
[0m ip-172-31-0-137:14605:14673 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.0.137<0> [1]veth75fa783:fe80::14ad:64ff:fe64:b31e%veth75fa783<0>
[0m ip-172-31-0-137:14605:14673 [0] NCCL INFO Using network Socket
[0m NCCL version 2.7.8+cuda11.1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 00/04 :    0   1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 01/04 :    0   1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 02/04 :    0   1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 03/04 :    0   1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1 [2] -1/-1/-1->0->1|1->0->-1/-1/-1 [3] -1/-1/-1->0->1|1->0->-1/-1/-1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 00 : 1[1e0] -> 0[1e0] [receive] via NET/Socket/0
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 00 : 0[1e0] -> 1[1e0] [send] via NET/Socket/0

Rank 1:
[0m ip-172-31-86-69:1538:1569 [0] NCCL INFO Bootstrap : Using [0]ens5:172.31.86.69<0> [1]veth0bd9843:fe80::6c37:83ff:fe11:cbd6%veth0bd9843<0>
[0m ip-172-31-86-69:1538:1569 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
[0m 
[0m ip-172-31-86-69:1538:1569 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
[0m ip-172-31-86-69:1538:1569 [0] NCCL INFO NET/Socket : Using [0]ens5:172.31.86.69<0> [1]veth0bd9843:fe80::6c37:83ff:fe11:cbd6%veth0bd9843<0>
[0m ip-172-31-86-69:1538:1569 [0] NCCL INFO Using network Socket
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1 [2] 0/-1/-1->1->-1|-1->1->0/-1/-1 [3] 0/-1/-1->1->-1|-1->1->0/-1/-1
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Channel 00 : 0[1e0] -> 1[1e0] [receive] via NET/Socket/0
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Channel 00 : 1[1e0] -> 0[1e0] [send] via NET/Socket/0
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Channel 01 : 0[1e0] -> 1[1e0] [receive] via NET/Socket/1
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO Channel 01 : 1[1e0] -> 0[1e0] [send] via NET/Socket/1
[0m 
[0m ip-172-31-86-69:1538:1572 [0] include/socket.h:403 NCCL WARN Connect to fe80::14ad:64ff:fe64:b31e%7<54043> failed : Network is unreachable
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO transport/net_socket.cc:313 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO include/net.h:21 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO transport/net.cc:161 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO transport.cc:68 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO init.cc:766 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO init.cc:840 -> 2
[0m ip-172-31-86-69:1538:1572 [0] NCCL INFO group.cc:73 -> 2 [Async thread]

Rank 0:
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 01 : 1[1e0] -> 0[1e0] [receive] via NET/Socket/1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Channel 01 : 0[1e0] -> 1[1e0] [send] via NET/Socket/1
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO Call to connect returned Connection refused, retrying
[0m 
[0m ip-172-31-0-137:14605:14688 [0] include/socket.h:403 NCCL WARN Connect to 172.31.86.69<37817> failed : Connection refused
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO bootstrap.cc:95 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO bootstrap.cc:363 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO transport.cc:59 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO init.cc:766 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO init.cc:840 -> 2
[0m ip-172-31-0-137:14605:14688 [0] NCCL INFO group.cc:73 -> 2 [Async thread]

Final Output:

  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.                                                                                                                                        

A few other things worth mentioning:

  • The script runs fine with Gloo.
  • The script runs when downgrading to Torch 1.60 with CUDA 10.2 and NCCL 2.4.8.

Does anyone have any ideas or suggestions on how to debug this? Thanks in advance!

Hi, this might be a bug if it is working fine with an older version of torch/cuda.

Could you file an issue to Issues · pytorch/pytorch · GitHub with a detailed repro so that we can investigate? Thank you!

I was able to get past this issue by setting os.environ["NCCL_SOCKET_IFNAME"]="ens5". However, it’s still not clear to me why this is needed since this was working on an older version, so I created an issue here: NCCL Network is unreachable / Connection refused when initializing DDP · Issue #68893 · pytorch/pytorch · GitHub.