Hi, I am new in Pytorch, and I am going to deploy a distributed training task in 2 nodes which have 4 GPUS respectively. I have followed the comments in the code torch.distributed.launch
while still confused.
Node 1
CUDA_VISIBLE_DEVICES=3,2,1,0 python2 -m torch.distributed.launch \
--nproc_per_node=4 \
--nnodes=2 \
--node_rank=0 \
--master_addr="11.7.157.133" \
--master_port=12345 \
main.py --folder ./experiments/pairwise_shangyi_fpnembed
Node 2 script
CUDA_VISIBLE_DEVICES=3,2,1,0 python2 -m torch.distributed.launch \
--nproc_per_node=4 \
--nnodes=2 \
--node_rank=1 \
--master_addr="11.7.157.133" \
--master_port=12345 \
main.py --folder ./experiments/pairwise_shangyi_fpnembed
And I always meet the error in Node 2:
Traceback (most recent call last):
File "main.py", line 33, in <module>
trainer.train()
File "/export/home/v-jianjie/net/paizhaogou/metric_learning/trainer.py", line 165, in train
self.setup_network()
File "/export/home/v-jianjie/net/paizhaogou/metric_learning/trainer.py", line 90, in setup_network
broadcast_buffers=False,)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/distributed.py", line 134, in __init__
self.broadcast_bucket_size)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/distributed.py", line 251, in _dist_broadcast_coalesced
dist.broadcast(flat_tensors, 0)
File "/usr/local/lib/python2.7/dist-packages/torch/distributed/__init__.py", line 286, in broadcast
return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: NCCL error in: /export/home/v-yehl/code/caffe2/pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error
The main.py
script runs correctly in one single node.
Thx in advance.