Multi node distributed training: RuntimeError: NCCL error in /torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error

longjj · December 30, 2018, 12:54pm

Hi, I am new in Pytorch, and I am going to deploy a distributed training task in 2 nodes which have 4 GPUS respectively. I have followed the comments in the code torch.distributed.launch while still confused.

Node 1

CUDA_VISIBLE_DEVICES=3,2,1,0 python2 -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="11.7.157.133" \
    --master_port=12345 \
    main.py --folder ./experiments/pairwise_shangyi_fpnembed

Node 2 script

CUDA_VISIBLE_DEVICES=3,2,1,0 python2 -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="11.7.157.133" \
    --master_port=12345 \
    main.py --folder ./experiments/pairwise_shangyi_fpnembed

And I always meet the error in Node 2:

Traceback (most recent call last):
  File "main.py", line 33, in <module>
    trainer.train()
  File "/export/home/v-jianjie/net/paizhaogou/metric_learning/trainer.py", line 165, in train
    self.setup_network()
  File "/export/home/v-jianjie/net/paizhaogou/metric_learning/trainer.py", line 90, in setup_network
    broadcast_buffers=False,)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/distributed.py", line 134, in __init__
    self.broadcast_bucket_size)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/distributed.py", line 251, in _dist_broadcast_coalesced
    dist.broadcast(flat_tensors, 0)
  File "/usr/local/lib/python2.7/dist-packages/torch/distributed/__init__.py", line 286, in broadcast
    return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: NCCL error in: /export/home/v-yehl/code/caffe2/pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error

The main.py script runs correctly in one single node.

Thx in advance.

longjj · December 31, 2018, 9:45am

From this I found the solution.

If we use nvidia-docker, you need to add --network=host param in the docker run command in order to let the docker container use the same ip address as the host.

pietern · January 8, 2019, 8:31pm

The NCCL error you post doesn’t convey any information that can help unfortunately. Take a look at https://pytorch.org/docs/stable/distributed.html#other-nccl-environment-variables for some environment variables you can set that may help you in debugging this issue.