torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8

Hi
I have 4 RTX3090 GPU cards. Each GPU card can work well. I set the parameter
export CUDA_VISIBLE_DEVICES=“0,2” or export CUDA_VISIBLE_DEVICES=“1,3”, it also works well.
but when I set export CUDA_VISIBLE_DEVICES=“0,1,2,3” , use four GPU cards, NCCL error occurs.
My conda install command is:conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia (Pytorch1.8 also the same problem)

The Log information is below:
Traceback (most recent call last):
File “wenet/bin/train.py”, line 197, in
model = torch.nn.parallel.DistributedDataParallel(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 446, in init
dnn62:128319:128450 [0] NCCL INFO Channel 00 : 0[18000] → 1[3b000] via P2P/IPC
self._sync_params_and_buffers(authoritative_rank=0)
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

dnn62:128319:128450 [0] transport/p2p.cc:238 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
dnn62:128319:128450 [0] NCCL INFO transport.cc:68 → 1
dnn62:128319:128450 [0] NCCL INFO init.cc:766 → 1
dnn62:128319:128450 [0] NCCL INFO init.cc:840 → 1
dnn62:128319:128450 [0] NCCL INFO group.cc:73 → 1 [Async thread]
Traceback (most recent call last):
File “wenet/bin/train.py”, line 197, in
model = torch.nn.parallel.DistributedDataParallel(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 446, in init
self._sync_params_and_buffers(authoritative_rank=0)
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 1155, in _distributed_broadcast_coalesced

dnn62:128322:128451 [0] transport/p2p.cc:276 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
dnn62:128322:128451 [0] NCCL INFO transport.cc:79 → 1
dnn62:128322:128451 [0] NCCL INFO init.cc:766 → 1
dnn62:128322:128451 [0] NCCL INFO init.cc:840 → 1
dnn62:128322:128451 [0] NCCL INFO group.cc:73 → 1 [Async thread]
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c1.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
File “wenet/bin/train.py”, line 197, in
model = torch.nn.parallel.DistributedDataParallel(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 446, in init

dnn62:128321:128452 [0] transport/p2p.cc:276 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
dnn62:128321:128452 [0] NCCL INFO transport.cc:79 → 1
dnn62:128321:128452 [0] NCCL INFO init.cc:766 → 1
dnn62:128321:128452 [0] NCCL INFO init.cc:840 → 1
dnn62:128321:128452 [0] NCCL INFO group.cc:73 → 1 [Async thread]
self._sync_params_and_buffers(authoritative_rank=0)
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 1155, in _distributed_broadcast_coalesced
Traceback (most recent call last):
File “wenet/bin/train.py”, line 197, in
model = torch.nn.parallel.DistributedDataParallel(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 446, in init
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
self._sync_params_and_buffers(authoritative_rank=0)
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File “/sunj/yanlong/miniconda3/envs/3090/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.