Hello, I am getting the above error and this is the log after using export NCCL_DEBUG=INFO.
1dfff1d89025:310:310 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.2<0>
1dfff1d89025:310:310 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
1dfff1d89025:310:310 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
1dfff1d89025:310:310 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
1dfff1d89025:310:310 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
1dfff1d89025:310:337 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:01/…/…/0000:01:00.0
1dfff1d89025:310:337 [0] NCCL INFO graph/xml.cc:469 → 2
1dfff1d89025:310:337 [0] NCCL INFO graph/xml.cc:660 → 2
1dfff1d89025:310:337 [0] NCCL INFO graph/topo.cc:523 → 2
1dfff1d89025:310:337 [0] NCCL INFO init.cc:581 → 2
1dfff1d89025:310:337 [0] NCCL INFO init.cc:840 → 2
1dfff1d89025:310:337 [0] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “/home/RNNPose/tools/eval.py”, line 579, in
fire.Fire()
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/fire/core.py”, line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/fire/core.py”, line 471, in _Fire
target=component.name)
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/fire/core.py”, line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File “/home/RNNPose/tools/eval.py”, line 227, in multi_proc_train
args=( params,) )
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’)
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 157, in start_processes
while not context.join():
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 118, in join
raise Exception(msg)
Exception:
– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap
fn(i, *args)
File “/home/RNNPose/tools/eval.py”, line 255, in train_worker
apex_opt_level=params.apex_opt_level
File “/home/RNNPose/tools/eval.py”, line 317, in eval
backend=“nccl”, init_method=dist_url, world_size=get_world(use_dist), rank=get_rank(use_dist))
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/opt/miniconda3/envs/py37/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
I am running the program inside the docker container.
Python: 3.7.10
Torch: 1.7.1
CUDA: 10.2
Solutions Tried so far:
- using –ipc=host
- export NCCL_IB_DISABLE=1
- torch.cuda.set_device(rank)
Any help is appreciated and let me know if needed more details. Thanks in advance.