faced NCCL error while trying to run python training script:
Init multi-processing training...
d13186ffee3a:57:57 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.2<0>
d13186ffee3a:57:57 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
d13186ffee3a:57:57 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
d13186ffee3a:57:57 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
NCCL version 2.4.8+cuda10.1
d13186ffee3a:57:75 [0] misc/topo.cc:22 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:02/../../0000:02:00.0
d13186ffee3a:57:75 [0] NCCL INFO init.cc:876 -> 2
d13186ffee3a:57:75 [0] NCCL INFO init.cc:909 -> 2
d13186ffee3a:57:75 [0] NCCL INFO init.cc:947 -> 2
d13186ffee3a:57:75 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
File "train.py", line 174, in <module>
run_train()
File "train.py", line 171, in run_train
multi_train(args, config, Network)
File "train.py", line 155, in multi_train
torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/crowddet/tools/train.py", line 100, in train_worker
net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[rank], broadcast_buffers=False)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 285, in __init__
self.broadcast_bucket_size)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 483, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1587428398394/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
env:
CUDNN_VERSION=7.6.5.32
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LESSCLOSE=/usr/bin/lesspipe %s %s
NVIDIA_VISIBLE_DEVICES=all
NCCL_VERSION=2.4.8
PWD=/crowddet
HOME=/root
CMAKE_PREFIX_PATH=$(dirname $(which conda))/../
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
TERM=xterm
TORCH_CUDA_ARCH_LIST=6.0 6.1 7.0+PTX
CUDA_PKG_VERSION=10-1=10.1.243-1
CUDA_VERSION=10.1.243
NVIDIA_DRIVER_CAPABILITIES=compute,utility
SHLVL=1
NVIDIA_REQUIRE_CUDA=cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411
TORCH_NVCC_FLAGS=-Xfatbin -compress-all
PATH=/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LESSOPEN=| /usr/bin/lesspipe %s
_=/usr/bin/env
tried solutions of many nccl errors but still unable to solve