Encounter Error while running distributed training on fairseq

Thank you @pietern and @zhangguanheng66 for your suggestion. I have modify IP address and NCCL environment variable but now getting different error. I have referred the following issues to resolve the issue but seems it didn’t help me much.

I have simple multinode GPU architecture 2 nodes in total and 1 GPU on each node so total GPUs are 2.

LOG on Worker node:

Traceback (most recent call last):
  File "software//fairseq-py/train.py", line 347, in <module>
    distributed_main(args)
  File "software/fairseq-py/distributed_train.py", line 39, in main
    single_process_main(args)
  File "software/fairseq-py/train.py", line 87, in main
    train(args, trainer, task, epoch_itr)
  File "software/fairseq-py/train.py", line 125, in train
    log_output = trainer.train_step(sample, update_params=True)
  File "software/fairseq-py/fairseq/trainer.py", line 137, in train_step
    (sample_sizes, logging_outputs, ooms_fwd, ooms_bwd)
  File "software/fairseq-py/fairseq/distributed_utils.py", line 77, in all_gather_list
    torch.distributed.all_gather(out_buffers, in_buffer.cuda())
  File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 439, in all_gather
    return all_gather_multigpu([tensor_list], [tensor], group)
  File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 413, in all_gather_multigpu
    group)
RuntimeError: NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error

CUDA 10.1
CUDANN 7.6.4
NCCL 2.4.6
Pytorch 1.1.0

NCCL environment variables

export NCCL_SOCKET_IFNAME=ens3
export NCCL_DEBUG=INFO
export NCCL_IB_CUDA_SUPPORT=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NET_GDR_LEVEL=3
export NCCL_NET_GDR_READ=0
export NCCL_SHM_DISABLE=0

I have run nccl-test using this command it run perfectly. ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1

According to me CUDA, CudaNN and NCCL version are compatible with each other. Is there anything I’m missing? Any help or suggestion is appreciable.

Thanks,