Hello, this is the log I got.
mew1:387101:387101 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303
mew1:387101:387101 [0] NCCL INFO Bootstrap : Using eno8303:192.168.0.4<0>
mew1:387101:387101 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
mew1:387101:387101 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
mew1:387101:387101 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.17.1+cuda11.7
mew1:387101:387152 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303
mew1:387101:387152 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno8303:192.168.0.4<0>
mew1:387101:387152 [0] NCCL INFO Using network IB
mew1:387101:387151 [0] bootstrap.cc:126 NCCL WARN Bootstrap Root : mismatch in rank count from procs 4 : 6
mew1:387102:387102 [1] NCCL INFO cudaDriverVersion 12000
mew1:387102:387102 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303
mew1:387102:387102 [1] NCCL INFO Bootstrap : Using eno8303:192.168.0.4<0>
mew1:387102:387102 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
mew1:387102:387102 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
mew1:387102:387160 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303
mew1:387103:387103 [2] NCCL INFO cudaDriverVersion 12000
mew1:387103:387103 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303
mew1:387103:387103 [2] NCCL INFO Bootstrap : Using eno8303:192.168.0.4<0>
mew1:387103:387103 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
mew1:387103:387103 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
mew1:387104:387104 [3] NCCL INFO cudaDriverVersion 12000
mew1:387104:387104 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303
mew1:387104:387104 [3] NCCL INFO Bootstrap : Using eno8303:192.168.0.4<0>
mew1:387104:387104 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
mew1:387104:387104 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
mew1:387103:387161 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303
mew1:387104:387162 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno8303
mew1:387103:387161 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB eno8303:192.168.0.4<0>
mew1:387103:387161 [2] NCCL INFO Using network IB
mew1:387103:387161 [2] misc/socket.cc:480 NCCL WARN socketStartConnect: Connect to 192.168.0.4<35443> failed : Software caused connection abort
mew1:387103:387161 [2] NCCL INFO misc/socket.cc:561 → 2
mew1:387103:387161 [2] NCCL INFO misc/socket.cc:615 → 2
mew1:387103:387161 [2] NCCL INFO bootstrap.cc:270 → 2
mew1:387103:387161 [2] NCCL INFO init.cc:630 → 2
mew1:387103:387161 [2] NCCL INFO init.cc:1114 → 2
mew1:387103:387161 [2] NCCL INFO group.cc:64 → 2 [Async thread]
mew1:387103:387103 [2] NCCL INFO group.cc:422 → 2
mew1:387103:387103 [2] NCCL INFO group.cc:106 → 2
mew1:387103:387103 [0] NCCL INFO comm 0x5fb6c6e0 rank 2 nranks 4 cudaDev 2 busId ca000 - Abort COMPLETE
I will appreciate any help on this. Thanks a lot!