RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

Hello, I am getting the above error and this is the log after using export NCCL_DEBUG=INFO.

1dfff1d89025:310:310 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.2<0>
1dfff1d89025:310:310 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

1dfff1d89025:310:310 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
1dfff1d89025:310:310 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
1dfff1d89025:310:310 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2

1dfff1d89025:310:337 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/…/…/0000:01:00.0
1dfff1d89025:310:337 [0] NCCL INFO graph/xml.cc:469 → 2
1dfff1d89025:310:337 [0] NCCL INFO graph/xml.cc:660 → 2
1dfff1d89025:310:337 [0] NCCL INFO graph/topo.cc:523 → 2
1dfff1d89025:310:337 [0] NCCL INFO init.cc:581 → 2
1dfff1d89025:310:337 [0] NCCL INFO init.cc:840 → 2
1dfff1d89025:310:337 [0] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “/home/RNNPose/tools/eval.py”, line 579, in
fire.Fire()
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/fire/core.py”, line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/fire/core.py”, line 471, in _Fire
target=component.name)
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/fire/core.py”, line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File “/home/RNNPose/tools/eval.py”, line 227, in multi_proc_train
args=( params,) )
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’)
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 157, in start_processes
while not context.join():
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 118, in join
raise Exception(msg)
Exception:

– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap
fn(i, *args)
File “/home/RNNPose/tools/eval.py”, line 255, in train_worker
apex_opt_level=params.apex_opt_level
File “/home/RNNPose/tools/eval.py”, line 317, in eval
backend=“nccl”, init_method=dist_url, world_size=get_world(use_dist), rank=get_rank(use_dist))
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

I am running the program inside the docker container.
Python: 3.7.10
Torch: 1.7.1
CUDA: 10.2

Solutions Tried so far:

  1. using –ipc=host
  2. export NCCL_IB_DISABLE=1
  3. torch.cuda.set_device(rank)

Any help is appreciated and let me know if needed more details. Thanks in advance.

Thanks for reporting. the following error seems to be fatal
NCCL WARN Could not find real path of /sys/class/pci_bus

This is reported on NCCL 2.7 before by other users. NCCL experts suggested ugprading NCCL version to workdaround the topology detection NCCL WARN Could not find real path of... · Issue #573 · NVIDIA/nccl · GitHub