torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1168, unhandled system error, NCCL version 2.17.1

While I was running DDP tutorial (Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.0.1+cu117 documentation),
on two machines, I got this error

Traceback (most recent call last):

ddp_model = DDP(model, device_ids=[device_id])
File “/home/user/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 791, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File “/home/user/miniconda3/envs/torch2/lib/python3.10/site-packages/torch/distributed/utils.py”, line 265, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1168, unhandled system error, NCCL version 2.17.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 192.168.0.4<57117> failed : Software caused connection abort

and I don’t understand the error.
Each machine is equipped with nvlink connected GPUs, and two machines can communicate over TCP.
I set NCCL_SOCKET_IFNAME correctly after checking ifconfig.
To run the program i used following command

torchrun --nnodes=2 --nproc_per_node=$NUM_GPUS --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 ddp_test_elastic.py

When I ping the other machine, it works alright.
Can I get some hints to resolve this issue?
Thanks!

Could you rerun your workload with NCCL_DEBUG=INFO and post the logs here, please?

Hello, this is the log I got.

mew1:387101:387101 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303

mew1:387101:387101 [0] NCCL INFO Bootstrap : Using eno8303:192.168.0.4<0>

mew1:387101:387101 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

mew1:387101:387101 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation

mew1:387101:387101 [0] NCCL INFO cudaDriverVersion 12000

NCCL version 2.17.1+cuda11.7

mew1:387101:387152 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303

mew1:387101:387152 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno8303:192.168.0.4<0>

mew1:387101:387152 [0] NCCL INFO Using network IB

mew1:387101:387151 [0] bootstrap.cc:126 NCCL WARN Bootstrap Root : mismatch in rank count from procs 4 : 6

mew1:387102:387102 [1] NCCL INFO cudaDriverVersion 12000

mew1:387102:387102 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303

mew1:387102:387102 [1] NCCL INFO Bootstrap : Using eno8303:192.168.0.4<0>

mew1:387102:387102 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

mew1:387102:387102 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation

mew1:387102:387160 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303

mew1:387103:387103 [2] NCCL INFO cudaDriverVersion 12000

mew1:387103:387103 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303

mew1:387103:387103 [2] NCCL INFO Bootstrap : Using eno8303:192.168.0.4<0>

mew1:387103:387103 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

mew1:387103:387103 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation

mew1:387104:387104 [3] NCCL INFO cudaDriverVersion 12000

mew1:387104:387104 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303

mew1:387104:387104 [3] NCCL INFO Bootstrap : Using eno8303:192.168.0.4<0>

mew1:387104:387104 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

mew1:387104:387104 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation

mew1:387103:387161 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303

mew1:387104:387162 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303

mew1:387103:387161 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno8303:192.168.0.4<0>

mew1:387103:387161 [2] NCCL INFO Using network IB

mew1:387103:387161 [2] misc/socket.cc:480 NCCL WARN socketStartConnect: Connect to 192.168.0.4<35443> failed : Software caused connection abort

mew1:387103:387161 [2] NCCL INFO misc/socket.cc:561 → 2

mew1:387103:387161 [2] NCCL INFO misc/socket.cc:615 → 2

mew1:387103:387161 [2] NCCL INFO bootstrap.cc:270 → 2

mew1:387103:387161 [2] NCCL INFO init.cc:630 → 2

mew1:387103:387161 [2] NCCL INFO init.cc:1114 → 2

mew1:387103:387161 [2] NCCL INFO group.cc:64 → 2 [Async thread]

mew1:387103:387103 [2] NCCL INFO group.cc:422 → 2

mew1:387103:387103 [2] NCCL INFO group.cc:106 → 2

mew1:387103:387103 [0] NCCL INFO comm 0x5fb6c6e0 rank 2 nranks 4 cudaDev 2 busId ca000 - Abort COMPLETE

I will appreciate any help on this. Thanks a lot!

It seems the actual connection fails. Did you make sure the nodes can access each other?

Will there be any quick way to check?
When I checked with ping, they could send and receive properly.

Are you using WSL and could it be a firewall/routing issue as described here?

well i’m using a bare metal linux server and i confirmed the linux firewall is not running on mine. if it’s a routing problem, packets can’t be able to reached by ping as well. can they…?