Connection refused

antae · June 13, 2021, 1:10pm

hi I implemented tutorial codes in distributed session.
I used node 0 that consisted of two rtx 6000 and node 1 that have a 2080 super.
is it occurred by a mismatch between both nodes?
there are error logs at below.
How can I fixed this problem?

node 0 (master)
python -m torch.distributed.launch --nnode=2 --node_rank=0 --nproc_per_node=1 multi_gpu_pratice/getting_start.py --local_world_size=1

[281611] Initializing process group with: {‘MASTER_ADDR’: ‘163.247.44.175’, ‘MASTER_PORT’: ‘7899’, ‘RANK’: ‘0’, ‘WORLD_SIZE’: ‘2’}
Precision-7920-Tower:281611:281611 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0>
Precision-7920-Tower:281611:281611 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

Precision-7920-Tower:281611:281611 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Precision-7920-Tower:281611:281611 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
Precision-7920-Tower:281611:281611 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.0

node 1
python -m torch.distributed.launch --nnode=2 --node_rank=1 --nproc_per_node=1 multi_gpu_pratice/getting_start.py --local_world_size=1

[8658] world_size = 2, rank = 1, backend=nccl
[8658] rank = 1, world_size = 2, n = 1, device_ids = [0]
MS-7B23:8658:8658 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0>
MS-7B23:8658:8658 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
MS-7B23:8658:8658 [0] NCCL INFO NET/IB : No device found.
MS-7B23:8658:8658 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
MS-7B23:8658:8664 [0] NCCL INFO Setting affinity for GPU 0 to 3f
MS-7B23:8658:8664 [0] NCCL INFO Call to connect returned Connection refused, retrying
MS-7B23:8658:8664 [0] NCCL INFO Call to connect returned Connection refused, retrying

cbalioglu · June 14, 2021, 2:38pm

In your command line I don’t see the --master-addr option which is required for a multi-node training. Do you mind retrying your job with the following?

# Node 0
pytorch -m torch.distributed.launch --nnode=2 --node_rank=0 --nproc_per_node=1 --master_addr="<hostname_of_rank_0>" multi_gpu_practice/getting_start.py --local_world_size=1

# Node 1
pytorch -m torch.distributed.launch --nnode=2 --node_rank=1 --nproc_per_node=1 --master_addr="<hostname_of_rank_0>" multi_gpu_practice/getting_start.py --local_world_size=1

antae · June 15, 2021, 5:29am

Thanks for your reply, but --master_addr is setted in code directly.
I initialize NCCL_SOCKET_IFNAME="^eno1,enp0s31f6", but it just use lo socket.

gcramer23 · June 17, 2021, 6:53pm

Precision-7920-Tower:281611:281611 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Precision-7920-Tower:281611:281611 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
Precision-7920-Tower:281611:281611 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.0

MS-7B23:8658:8658 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0>
MS-7B23:8658:8658 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
MS-7B23:8658:8658 [0] NCCL INFO NET/IB : No device found.
MS-7B23:8658:8658 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
MS-7B23:8658:8664 [0] NCCL INFO Setting affinity for GPU 0 to 3f
MS-7B23:8658:8664 [0] NCCL INFO Call to connect returned Connection refused, retrying
MS-7B23:8658:8664 [0] NCCL INFO Call to connect returned Connection refused, retrying

It looks like NCCL is having a problem establishing a connection. Can you verify that your interfaces are correct. The Common environment variables section provides some information Distributed communication package - torch.distributed — PyTorch 2.1 documentation.