torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8

Hi
I have 4 RTX3090 GPU cards. Each GPU card can work well. I set the parameter
export CUDA_VISIBLE_DEVICES=“0,2” or export CUDA_VISIBLE_DEVICES=“1,3”, it also works well.
but when I set export CUDA_VISIBLE_DEVICES=“0,1,2,3” , use four GPU cards, NCCL error occurs.
My conda install command is:conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia (Pytorch1.8 also the same problem)

The Log information is below:
Traceback (most recent call last):
File “wenet/bin/train.py”, line 197, in
model = torch.nn.parallel.DistributedDataParallel(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 446, in init
dnn62:128319:128450 [0] NCCL INFO Channel 00 : 0[18000] → 1[3b000] via P2P/IPC
self._sync_params_and_buffers(authoritative_rank=0)
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

dnn62:128319:128450 [0] transport/p2p.cc:238 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
dnn62:128319:128450 [0] NCCL INFO transport.cc:68 → 1
dnn62:128319:128450 [0] NCCL INFO init.cc:766 → 1
dnn62:128319:128450 [0] NCCL INFO init.cc:840 → 1
dnn62:128319:128450 [0] NCCL INFO group.cc:73 → 1 [Async thread]
Traceback (most recent call last):
File “wenet/bin/train.py”, line 197, in
model = torch.nn.parallel.DistributedDataParallel(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 446, in init
self._sync_params_and_buffers(authoritative_rank=0)
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 1155, in _distributed_broadcast_coalesced

dnn62:128322:128451 [0] transport/p2p.cc:276 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
dnn62:128322:128451 [0] NCCL INFO transport.cc:79 → 1
dnn62:128322:128451 [0] NCCL INFO init.cc:766 → 1
dnn62:128322:128451 [0] NCCL INFO init.cc:840 → 1
dnn62:128322:128451 [0] NCCL INFO group.cc:73 → 1 [Async thread]
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c1.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
File “wenet/bin/train.py”, line 197, in
model = torch.nn.parallel.DistributedDataParallel(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 446, in init

dnn62:128321:128452 [0] transport/p2p.cc:276 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
dnn62:128321:128452 [0] NCCL INFO transport.cc:79 → 1
dnn62:128321:128452 [0] NCCL INFO init.cc:766 → 1
dnn62:128321:128452 [0] NCCL INFO init.cc:840 → 1
dnn62:128321:128452 [0] NCCL INFO group.cc:73 → 1 [Async thread]
self._sync_params_and_buffers(authoritative_rank=0)
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 1155, in _distributed_broadcast_coalesced
Traceback (most recent call last):
File “wenet/bin/train.py”, line 197, in
model = torch.nn.parallel.DistributedDataParallel(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 446, in init
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
self._sync_params_and_buffers(authoritative_rank=0)
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

Hello,
I also face the exact same problem and have 4 RTX 3090 GPUs in a server and torch 1.8.1 with CUDA 11.1.

Does anyone have or discuss solutions to it further?

Thanks in advance.

The error points to an unsupported p2p setup, so either disable p2p via NCCL_P2P_DISABLE=1 and/or update to the latest PyTorch release to see if you are still running into this error.

NCCL_P2P_DISABLE=1 can solve the problem.

Thank you very much!