NCCL error of PyTorch 2.1.0 when using multiple gpus

Hi,

I encountered some NCCL error when using pytorch version 2.1.0 with multiple gpus.
When I downgraded pytorch to 2.0.1, the error disappeared. The following is a minimal example:

export NCCL_DEBUG=INFO

import torch

torch.cuda.nccl.all_gather([torch.zeros(5).cuda()], [torch.zeros(5).cuda()])

Error Message

JMUSE-AS-2124GQ-NART:2311505:2311505 [0] NCCL INFO cudaDriverVersion 12000
JMUSE-AS-2124GQ-NART:2311505:2311505 [0] NCCL INFO Bootstrap : Using enxb03af2b6059f:169.254.3.1<0>
JMUSE-AS-2124GQ-NART:2311505:2311505 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
JMUSE-AS-2124GQ-NART:2311505:2311505 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
NCCL version 2.18.5+cuda11.8
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO NET/IB : No device found.
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO NET/Socket : Using [0]enxb03af2b6059f:169.254.3.1<0> [1]enp97s0f0:192.168.187.57<0> [2]tun0:10.8.0.1<0> [3]br-cf914d50c295:172.18.0.1<0> [4]br-0eebd337ccf4:172.19.0.1<0> [5]tun1:198.18.192.11<0> [6]veth232c90d:fe80::985b:78ff:feeb:62d9%veth232c90d<0> [7]vethb4455b0:fe80::dc02:2dff:fe34:ab7f%vethb4455b0<0> [8]veth83972a3:fe80::7056:a7ff:fec4:94a5%veth83972a3<0> [9]veth8319398:fe80::440c:daff:fe3c:354c%veth8319398<0> [10]veth70b5a2c:fe80::1ccc:70ff:feb2:af31%veth70b5a2c<0> [11]veth67387c5:fe80::7c78:6fff:fe57:c5bf%veth67387c5<0> [12]veth7be2777:fe80::8408:f5ff:fe93:685d%veth7be2777<0> [13]vethb8dc075:fe80::9405:e4ff:fe22:dfa2%vethb8dc075<0>
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO Using network Socket
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO comm 0x560e80708b00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 1000 commId 0x12a81fddad48cd92 - Init START

JMUSE-AS-2124GQ-NART:2311505:2311519 [0] graph/xml.h:85 NCCL WARN Attribute busid of node nic not found
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO graph/xml.cc:585 -> 3
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO graph/xml.cc:767 -> 3
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO graph/topo.cc:655 -> 3
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO init.cc:840 -> 3
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO init.cc:1358 -> 3
JMUSE-AS-2124GQ-NART:2311505:2311519 [0] NCCL INFO group.cc:65 -> 3 [Async thread]
JMUSE-AS-2124GQ-NART:2311505:2311505 [0] NCCL INFO group.cc:406 -> 3
JMUSE-AS-2124GQ-NART:2311505:2311505 [0] NCCL INFO group.cc:96 -> 3
JMUSE-AS-2124GQ-NART:2311505:2311505 [0] NCCL INFO init.cc:1691 -> 3
Traceback (most recent call last):
  File "/mnt/ramdisk0/test.py", line 2, in <module>
    torch.cuda.nccl.all_gather([torch.zeros(5).cuda()], [torch.zeros(5).cuda()])
  File "/home/ping/mambaforge/envs/pytorch-nccl/lib/python3.11/site-packages/torch/cuda/nccl.py", line 121, in all_gather
    torch._C._nccl_all_gather(inputs, outputs, streams, comms)
RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers

For detailed environment, please see NCCL error of PyTorch 2.1.0 when using multiple gpus · Issue #113245 · pytorch/pytorch · GitHub.