Multi node distributed training: RuntimeError: NCCL error in /torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error

Hi, I am new in Pytorch, and I am going to deploy a distributed training task in 2 nodes which have 4 GPUS respectively. I have followed the comments in the code torch.distributed.launch while still confused.

Node 1

CUDA_VISIBLE_DEVICES=3,2,1,0 python2 -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="11.7.157.133" \
    --master_port=12345 \
    main.py --folder ./experiments/pairwise_shangyi_fpnembed

Node 2 script

CUDA_VISIBLE_DEVICES=3,2,1,0 python2 -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="11.7.157.133" \
    --master_port=12345 \
    main.py --folder ./experiments/pairwise_shangyi_fpnembed

And I always meet the error in Node 2:

Traceback (most recent call last):
  File "main.py", line 33, in <module>
    trainer.train()
  File "/export/home/v-jianjie/net/paizhaogou/metric_learning/trainer.py", line 165, in train
    self.setup_network()
  File "/export/home/v-jianjie/net/paizhaogou/metric_learning/trainer.py", line 90, in setup_network
    broadcast_buffers=False,)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/distributed.py", line 134, in __init__
    self.broadcast_bucket_size)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/distributed.py", line 251, in _dist_broadcast_coalesced
    dist.broadcast(flat_tensors, 0)
  File "/usr/local/lib/python2.7/dist-packages/torch/distributed/__init__.py", line 286, in broadcast
    return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: NCCL error in: /export/home/v-yehl/code/caffe2/pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error

The main.py script runs correctly in one single node.

Thx in advance.

From this I found the solution.

If we use nvidia-docker, you need to add --network=host param in the docker run command in order to let the docker container use the same ip address as the host.

The NCCL error you post doesn’t convey any information that can help unfortunately. Take a look at https://pytorch.org/docs/stable/distributed.html#other-nccl-environment-variables for some environment variables you can set that may help you in debugging this issue.