NCCL error happens when I try to run a job on 3 nodes. Everything works fine when running on a single node.
My launching command is:
/usr/local/bin/mpirun --hostfile /var/storage/shared/resrchvc/sys/jobs/application_1602032654055_58426/scratch/1/mpi-hosts --tag-output -x NCCL_IB_DISABLE=1 -np 4 -map-by node -bind-to none -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PT_OUTPUT_DIR -x PT_DATA_DIR -x PT_LOGS_DIR -x PT_CODE_DIR -x PYTHONBREAKPOINT -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_0,mlx5_2 -x NCCL_SOCKET_IFNAME=ib0 mmf_run config=projects/visual_bert/configs/localized_narratives/pretrain.yaml model=visual_bert dataset=masked_localized_narratives run_type=train env.cache_dir=/mnt/default/mmf_cache env.save_dir=/mnt/output/projects/mmf/ln_mask_pretrain_experiment/pt-results/application_1602032654055_58426 env.log_dir=/mnt/output/projects/mmf/ln_mask_pretrain_experiment/logs/application_1602032654055_58426 env.data_dir=/mnt/default/mmf_cache/data
The detailed debug info is shown below. Any idea on how to tackle this problem?
........
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_2
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_2
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_0
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_0
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_3
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_3
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_1
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_1
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] NCCL INFO NET/IB : No device found.
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] NCCL INFO NET/IB : No device found.
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:265:265 [4] NCCL INFO NET/Socket : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:264 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:266 [5] NCCL INFO NET/Socket : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO Bootstrap : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_2
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_0
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_3
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
[1,3]<stdout>:
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] transport/net_ib.cc:117 NCCL WARN NET/IB : Unable to open device mlx5_1
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO NET/IB : No device found.
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:263 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.33.19<0>
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO Setting affinity for GPU 0 to 1fc07f
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:558 [2] NCCL INFO Setting affinity for GPU 2 to 1fc07f
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:262:553 [1] NCCL INFO Setting affinity for GPU 1 to 1fc07f
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:557 [5] NCCL INFO Setting affinity for GPU 5 to 0fe03f80
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:267:552 [6] NCCL INFO Setting affinity for GPU 6 to 0fe03f80
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:265:555 [4] NCCL INFO Setting affinity for GPU 4 to 0fe03f80
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:556 [3] NCCL INFO Setting affinity for GPU 3 to 1fc07f
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:269:554 [7] NCCL INFO Setting affinity for GPU 7 to 0fe03f80
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:265:555 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:269:554 [7] NCCL INFO Ring 00 : 7[7] -> 0[0] via direct shared memory
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:556 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via direct shared memory
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:557 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:267:552 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:262:553 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:558 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:265:555 [4] NCCL INFO comm 0x7f5da4002600 rank 4 nranks 8 cudaDev 4 nvmlDev 4 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:266:557 [5] NCCL INFO comm 0x7fdb60002600 rank 5 nranks 8 cudaDev 5 nvmlDev 5 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:551 [0] NCCL INFO comm 0x7f1f34002600 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:261:261 [0] NCCL INFO Launch mode Parallel
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:269:554 [7] NCCL INFO comm 0x7f656c002600 rank 7 nranks 8 cudaDev 7 nvmlDev 7 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:264:556 [3] NCCL INFO comm 0x7f7dcc002600 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:267:552 [6] NCCL INFO comm 0x7f2b18002600 rank 6 nranks 8 cudaDev 6 nvmlDev 6 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:262:553 [1] NCCL INFO comm 0x7fa314002600 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
[1,3]<stdout>:container-e2250-1602032654055-58426-01-000008:263:558 [2] NCCL INFO comm 0x7fe718002600 rank 2 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf: e[0mLogging to: /mnt/output/projects/mmf/ln_mask_pretrain_experiment/pt-results/application_1602032654055_58426/train.log
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf_cli.run: e[0mNamespace(config_override=None, local_rank=None, opts=['config=projects/visual_bert/configs/localized_narratives/pretrain.yaml', 'model=visual_bert', 'dataset=masked_localized_narratives', 'run_type=train', 'env.cache_dir=/mnt/default/mmf_cache', 'env.save_dir=/mnt/output/projects/mmf/ln_mask_pretrain_experiment/pt-results/application_1602032654055_58426', 'env.log_dir=/mnt/output/projects/mmf/ln_mask_pretrain_experiment/logs/application_1602032654055_58426', 'env.data_dir=/mnt/default/mmf_cache/data'])
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf_cli.run: e[0mTorch version: 1.6.0+cu101
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf.utils.general: e[0mCUDA Device 0 is: Tesla V100-PCIE-32GB
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf_cli.run: e[0mUsing seed 39811694
[1,3]<stdout>:e[32m2020-10-22T19:32:39 | mmf.trainers.mmf_trainer: e[0mLoading datasets
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "/home/v-kunyan/.local/bin/mmf_run", line 8, in <module>
[1,1]<stderr>: sys.exit(run())
[1,1]<stderr>: File "/home/v-kunyan/.local/lib/python3.7/site-packages/mmf_cli/run.py", line 118, in run
[1,1]<stderr>: nprocs=config.distributed.world_size,
[1,1]<stderr>: File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn
[1,1]<stderr>: return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
[1,1]<stderr>: File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
[1,1]<stderr>: while not context.join():
[1,1]<stderr>: File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
[1,1]<stderr>: raise Exception(msg)
[1,1]<stderr>:Exception:
[1,1]<stderr>:
[1,1]<stderr>:-- Process 0 terminated with the following error:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
[1,1]<stderr>: fn(i, *args)
[1,1]<stderr>: File "/home/v-kunyan/.local/lib/python3.7/site-packages/mmf_cli/run.py", line 66, in distributed_main
[1,1]<stderr>: main(configuration, init_distributed=True, predict=predict)
[1,1]<stderr>: File "/home/v-kunyan/.local/lib/python3.7/site-packages/mmf_cli/run.py", line 33, in main
[1,1]<stderr>: distributed_init(config)
[1,1]<stderr>: File "/home/v-kunyan/.local/lib/python3.7/site-packages/mmf/utils/distributed.py", line 244, in distributed_init
[1,1]<stderr>: dist.all_reduce(torch.zeros(1).cuda())
[1,1]<stderr>: File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
[1,1]<stderr>: work = _default_pg.allreduce([tensor], opts)
[1,1]<stderr>:RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:492, internal error, NCCL version 2.4.8
[1,1]<stderr>:
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[39364,1],1]
Exit code: 1