hi I implemented tutorial codes in distributed session.
I used node 0 that consisted of two rtx 6000 and node 1 that have a 2080 super.
is it occurred by a mismatch between both nodes?
there are error logs at below.
How can I fixed this problem?
[281611] Initializing process group with: {‘MASTER_ADDR’: ‘163.247.44.175’, ‘MASTER_PORT’: ‘7899’, ‘RANK’: ‘0’, ‘WORLD_SIZE’: ‘2’}
Precision-7920-Tower:281611:281611 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0>
Precision-7920-Tower:281611:281611 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Precision-7920-Tower:281611:281611 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Precision-7920-Tower:281611:281611 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
Precision-7920-Tower:281611:281611 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.0
[8658] world_size = 2, rank = 1, backend=nccl
[8658] rank = 1, world_size = 2, n = 1, device_ids = [0]
MS-7B23:8658:8658 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0>
MS-7B23:8658:8658 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
MS-7B23:8658:8658 [0] NCCL INFO NET/IB : No device found.
MS-7B23:8658:8658 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
MS-7B23:8658:8664 [0] NCCL INFO Setting affinity for GPU 0 to 3f
MS-7B23:8658:8664 [0] NCCL INFO Call to connect returned Connection refused, retrying
MS-7B23:8658:8664 [0] NCCL INFO Call to connect returned Connection refused, retrying
In your command line I don’t see the --master-addr option which is required for a multi-node training. Do you mind retrying your job with the following?
Precision-7920-Tower:281611:281611 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Precision-7920-Tower:281611:281611 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
Precision-7920-Tower:281611:281611 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.0
MS-7B23:8658:8658 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0>
MS-7B23:8658:8658 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
MS-7B23:8658:8658 [0] NCCL INFO NET/IB : No device found.
MS-7B23:8658:8658 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
MS-7B23:8658:8664 [0] NCCL INFO Setting affinity for GPU 0 to 3f
MS-7B23:8658:8664 [0] NCCL INFO Call to connect returned Connection refused, retrying
MS-7B23:8658:8664 [0] NCCL INFO Call to connect returned Connection refused, retrying