RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:492, internal error, NCCL version 2.4.8

naykun · October 22, 2020, 8:08pm

NCCL error happens when I try to run a job on 3 nodes. Everything works fine when running on a single node.
My launching command is:

/usr/local/bin/mpirun --hostfile /var/storage/shared/resrchvc/sys/jobs/application_1602032654055_58426/scratch/1/mpi-hosts --tag-output -x NCCL_IB_DISABLE=1 -np 4 -map-by node -bind-to none -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PT_OUTPUT_DIR -x PT_DATA_DIR -x PT_LOGS_DIR -x PT_CODE_DIR -x PYTHONBREAKPOINT -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_0,mlx5_2 -x NCCL_SOCKET_IFNAME=ib0 mmf_run config=projects/visual_bert/configs/localized_narratives/pretrain.yaml model=visual_bert dataset=masked_localized_narratives run_type=train env.cache_dir=/mnt/default/mmf_cache env.save_dir=/mnt/output/projects/mmf/ln_mask_pretrain_experiment/pt-results/application_1602032654055_58426 env.log_dir=/mnt/output/projects/mmf/ln_mask_pretrain_experiment/logs/application_1602032654055_58426 env.data_dir=/mnt/default/mmf_cache/data

The detailed debug info is shown below. Any idea on how to tackle this problem?

........
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_2
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_2
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_0
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_0
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_3
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_3
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_1
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_1
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] NCCL INFO NET/IB : No device found.
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] NCCL INFO NET/IB : No device found.
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:265:265 [4] NCCL INFO NET/Socket : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] NCCL INFO NET/Socket : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO Bootstrap : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_2
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_0
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_3
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_1
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO NET/IB : No device found.
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO Setting affinity for GPU 0 to 1fc07f
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:558 [2] NCCL INFO Setting affinity for GPU 2 to 1fc07f
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:262:553 [1] NCCL INFO Setting affinity for GPU 1 to 1fc07f
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:557 [5] NCCL INFO Setting affinity for GPU 5 to 0fe03f80
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:267:552 [6] NCCL INFO Setting affinity for GPU 6 to 0fe03f80
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:265:555 [4] NCCL INFO Setting affinity for GPU 4 to 0fe03f80
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:556 [3] NCCL INFO Setting affinity for GPU 3 to 1fc07f
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:269:554 [7] NCCL INFO Setting affinity for GPU 7 to 0fe03f80
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5   6   7
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:265:555 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:269:554 [7] NCCL INFO Ring 00 : 7[7] -> 0[0] via direct shared memory
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:556 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via direct shared memory
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:557 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:267:552 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:262:553 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:558 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:265:555 [4] NCCL INFO comm 0x7f5da4002600 rank 4 nranks 8 cudaDev 4 nvmlDev 4 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:557 [5] NCCL INFO comm 0x7fdb60002600 rank 5 nranks 8 cudaDev 5 nvmlDev 5 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO comm 0x7f1f34002600 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:261 [0] NCCL INFO Launch mode Parallel
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:269:554 [7] NCCL INFO comm 0x7f656c002600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:556 [3] NCCL INFO comm 0x7f7dcc002600 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:267:552 [6] NCCL INFO comm 0x7f2b18002600 rank 6 nranks 8 cudaDev 6 nvmlDev 6 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:262:553 [1] NCCL INFO comm 0x7fa314002600 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:558 [2] NCCL INFO comm 0x7fe718002600 rank 2 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf: e[0mLogging to: /mnt/output/projects/mmf/ln_mask_pretrain_experiment/pt-results/application_1602032654055_58426/train.log
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf_cli.run: e[0mNamespace(config_override=None, local_rank=None, opts=['config=projects/visual_bert/configs/localized_narratives/pretrain.yaml', 'model=visual_bert', 'dataset=masked_localized_narratives', 'run_type=train', 'env.cache_dir=/mnt/default/mmf_cache', 'env.save_dir=/mnt/output/projects/mmf/ln_mask_pretrain_experiment/pt-results/application_1602032654055_58426', 'env.log_dir=/mnt/output/projects/mmf/ln_mask_pretrain_experiment/logs/application_1602032654055_58426', 'env.data_dir=/mnt/default/mmf_cache/data'])
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf_cli.run: e[0mTorch version: 1.6.0+cu101
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf.utils.general: e[0mCUDA Device 0 is: Tesla V100-PCIE-32GB
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf_cli.run: e[0mUsing seed 39811694
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf.trainers.mmf_trainer: e[0mLoading datasets
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/home/v-kunyan/.local/bin/mmf_run", line 8, in <module>
[1,1]<stderr>:    sys.exit(run())
[1,1]<stderr>:  File "/home/v-kunyan/.local/lib/python3.7/site-packages/mmf_cli/run.py", line 118, in run
[1,1]<stderr>:    nprocs=config.distributed.world_size,
[1,1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn
[1,1]<stderr>:    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
[1,1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
[1,1]<stderr>:    while not context.join():
[1,1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
[1,1]<stderr>:    raise Exception(msg)
[1,1]<stderr>:Exception: 
[1,1]<stderr>:
[1,1]<stderr>:-- Process 0 terminated with the following error:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
[1,1]<stderr>:    fn(i, *args)
[1,1]<stderr>:  File "/home/v-kunyan/.local/lib/python3.7/site-packages/mmf_cli/run.py", line 66, in distributed_main
[1,1]<stderr>:    main(configuration, init_distributed=True, predict=predict)
[1,1]<stderr>:  File "/home/v-kunyan/.local/lib/python3.7/site-packages/mmf_cli/run.py", line 33, in main
[1,1]<stderr>:    distributed_init(config)
[1,1]<stderr>:  File "/home/v-kunyan/.local/lib/python3.7/site-packages/mmf/utils/distributed.py", line 244, in distributed_init
[1,1]<stderr>:    dist.all_reduce(torch.zeros(1).cuda())
[1,1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
[1,1]<stderr>:    work = _default_pg.allreduce([tensor], opts)
[1,1]<stderr>:RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:492, internal error, NCCL version 2.4.8
[1,1]<stderr>:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[39364,1],1]
  Exit code:    1

osalpekar · October 22, 2020, 9:21pm

Typically this indicates an error in the NCCL library itself (not at the PyTorch layer), and as a result we don’t have much visibility into the cause of this error, unfortunately. Is this error consistent, or does the training work if you re-run? Are you using 3 nodes with 8 gpus each?

naykun · October 23, 2020, 5:34am

Yes, I’m using 3 nodes with 8 GPUs each and this error can be reproduced every time.

ptrblck · October 24, 2020, 7:08am

This sounds like a setup or NCCL issues, so you could install the NCCL test and check, if the mpi workload is working properly in your setup.

naykun · October 24, 2020, 4:36pm

The nccl test output is as follows:

Does it mean that the nccl setup is well done?
By the way, I’ve noticed the nccl version in my docker image is 2.7.8, but the runtime error says NCCL version is 2.4.8. It seems that PyTorch has another version installed internally, will the version mismatch lead to an error?
Thank you all for your time!

ptrblck · October 25, 2020, 1:34am

The NCCL submodule was updated to 2.7.8 approx. a month ago, so you could use the nightly binary to use the same version (which seems to work in your setup) or test 2.4.8 in the container.