Thank you @pietern and @zhangguanheng66 for your suggestion. I have modify IP address and NCCL environment variable but now getting different error. I have referred the following issues to resolve the issue but seems it didn’t help me much.
- https://github.com/pytorch/fairseq/issues/138
- Nccl error in torch._C._dist_broadcast(tensor, src, group) when train in two nodes
- Multi node distributed training: RuntimeError: NCCL error in /torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error
I have simple multinode GPU architecture 2 nodes in total and 1 GPU on each node so total GPUs are 2.
LOG on Worker node:
Traceback (most recent call last):
File "software//fairseq-py/train.py", line 347, in <module>
distributed_main(args)
File "software/fairseq-py/distributed_train.py", line 39, in main
single_process_main(args)
File "software/fairseq-py/train.py", line 87, in main
train(args, trainer, task, epoch_itr)
File "software/fairseq-py/train.py", line 125, in train
log_output = trainer.train_step(sample, update_params=True)
File "software/fairseq-py/fairseq/trainer.py", line 137, in train_step
(sample_sizes, logging_outputs, ooms_fwd, ooms_bwd)
File "software/fairseq-py/fairseq/distributed_utils.py", line 77, in all_gather_list
torch.distributed.all_gather(out_buffers, in_buffer.cuda())
File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 439, in all_gather
return all_gather_multigpu([tensor_list], [tensor], group)
File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 413, in all_gather_multigpu
group)
RuntimeError: NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error
CUDA 10.1
CUDANN 7.6.4
NCCL 2.4.6
Pytorch 1.1.0
NCCL environment variables
export NCCL_SOCKET_IFNAME=ens3
export NCCL_DEBUG=INFO
export NCCL_IB_CUDA_SUPPORT=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NET_GDR_LEVEL=3
export NCCL_NET_GDR_READ=0
export NCCL_SHM_DISABLE=0
I have run nccl-test using this command it run perfectly. ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1
According to me CUDA, CudaNN and NCCL version are compatible with each other. Is there anything I’m missing? Any help or suggestion is appreciable.
Thanks,