“The error message is as follows.”
ncll debug information is as follows.
autodl-container-c1cd47b9b8-7085d22d:4018:4018 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.11<0>
autodl-container-c1cd47b9b8-7085d22d:4018:4018 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
autodl-container-c1cd47b9b8-7085d22d:4018:4018 [0] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
autodl-container-c1cd47b9b8-7085d22d:4018:4018 [0] transport/net_ib.cc:149 NCCL WARN NET/IB : Unable to open device mlx5_0
autodl-container-c1cd47b9b8-7085d22d:4018:4018 [0] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
autodl-container-c1cd47b9b8-7085d22d:4018:4018 [0] transport/net_ib.cc:149 NCCL WARN NET/IB : Unable to open device mlx5_1
autodl-container-c1cd47b9b8-7085d22d:4018:4018 [0] NCCL INFO NET/IB : No device found.
autodl-container-c1cd47b9b8-7085d22d:4018:4018 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.11<0>
autodl-container-c1cd47b9b8-7085d22d:4018:4018 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
autodl-container-c1cd47b9b8-7085d22d:4020:4020 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.11<0>
autodl-container-c1cd47b9b8-7085d22d:4020:4020 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
autodl-container-c1cd47b9b8-7085d22d:4020:4020 [0] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
autodl-container-c1cd47b9b8-7085d22d:4020:4020 [0] transport/net_ib.cc:149 NCCL WARN NET/IB : Unable to open device mlx5_0
autodl-container-c1cd47b9b8-7085d22d:4020:4020 [0] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
autodl-container-c1cd47b9b8-7085d22d:4020:4020 [0] transport/net_ib.cc:149 NCCL WARN NET/IB : Unable to open device mlx5_1
autodl-container-c1cd47b9b8-7085d22d:4020:4020 [0] NCCL INFO NET/IB : No device found.
autodl-container-c1cd47b9b8-7085d22d:4020:4020 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.11<0>
autodl-container-c1cd47b9b8-7085d22d:4020:4020 [0] NCCL INFO Using network Socket
autodl-container-c1cd47b9b8-7085d22d:4019:4019 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.11<0>
autodl-container-c1cd47b9b8-7085d22d:4019:4019 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
autodl-container-c1cd47b9b8-7085d22d:4019:4019 [1] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
autodl-container-c1cd47b9b8-7085d22d:4019:4019 [1] transport/net_ib.cc:149 NCCL WARN NET/IB : Unable to open device mlx5_0
autodl-container-c1cd47b9b8-7085d22d:4019:4019 [1] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
autodl-container-c1cd47b9b8-7085d22d:4019:4019 [1] transport/net_ib.cc:149 NCCL WARN NET/IB : Unable to open device mlx5_1
autodl-container-c1cd47b9b8-7085d22d:4019:4019 [1] NCCL INFO NET/IB : No device found.
autodl-container-c1cd47b9b8-7085d22d:4019:4019 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.11<0>
autodl-container-c1cd47b9b8-7085d22d:4019:4019 [1] NCCL INFO Using network Socket
autodl-container-c1cd47b9b8-7085d22d:4021:4021 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.11<0>
autodl-container-c1cd47b9b8-7085d22d:4021:4021 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
autodl-container-c1cd47b9b8-7085d22d:4021:4021 [1] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
autodl-container-c1cd47b9b8-7085d22d:4021:4021 [1] transport/net_ib.cc:149 NCCL WARN NET/IB : Unable to open device mlx5_0
autodl-container-c1cd47b9b8-7085d22d:4021:4021 [1] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
autodl-container-c1cd47b9b8-7085d22d:4021:4021 [1] transport/net_ib.cc:149 NCCL WARN NET/IB : Unable to open device mlx5_1
autodl-container-c1cd47b9b8-7085d22d:4021:4021 [1] NCCL INFO NET/IB : No device found.
autodl-container-c1cd47b9b8-7085d22d:4021:4021 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.11<0>
autodl-container-c1cd47b9b8-7085d22d:4021:4021 [1] NCCL INFO Using network Socket
autodl-container-c1cd47b9b8-7085d22d:4018:4348 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 2 both on CUDA device b1000
autodl-container-c1cd47b9b8-7085d22d:4020:4349 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device b1000
autodl-container-c1cd47b9b8-7085d22d:4019:4350 [1] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 3 both on CUDA device b2000
autodl-container-c1cd47b9b8-7085d22d:4018:4348 [0] NCCL INFO init.cc:904 → 5
autodl-container-c1cd47b9b8-7085d22d:4021:4351 [1] init.cc:521 NCCL WARN Duplicate GPU detected : rank 3 and rank 1 both on CUDA device b2000
autodl-container-c1cd47b9b8-7085d22d:4020:4349 [0] NCCL INFO init.cc:904 → 5
autodl-container-c1cd47b9b8-7085d22d:4019:4350 [1] NCCL INFO init.cc:904 → 5
autodl-container-c1cd47b9b8-7085d22d:4018:4348 [0] NCCL INFO group.cc:72 → 5 [Async thread]
autodl-container-c1cd47b9b8-7085d22d:4021:4351 [1] NCCL INFO init.cc:904 → 5
autodl-container-c1cd47b9b8-7085d22d:4020:4349 [0] NCCL INFO group.cc:72 → 5 [Async thread]
autodl-container-c1cd47b9b8-7085d22d:4019:4350 [1] NCCL INFO group.cc:72 → 5 [Async thread]
autodl-container-c1cd47b9b8-7085d22d:4021:4351 [1] NCCL INFO group.cc:72 → 5 [Async thread]