Hi PyTorch Community Members,
I am trying to run distributed training on 2 nodes with 8 GPUs each (K80) in total 16 GPUs. I’m using following NCCL as backend and along with that I’m using following command to execute the distributed training.
I have set two NCCL environment flag
$ export NCCL_SOCKET_IFNAME=ens3
$ export NCCL_DEBUG=INFO
On 1st node I’m executing the fairseq training command with following distributed training flags:
PYTHONPATH=$FAIRSEQPY:$PYTHONPATH CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3.6 $FAIRSEQPY/train.py <ALL other training specific flags>... --distributed-world-size 16 --distributed-rank 0 --distributed-backend "nccl" --distributed-init-method 'tcp://54.146.137.72:9001' --distributed-port 9001
on 2nd node I’m executing the fairseq training command with following distributed training flags:
PYTHONPATH=$FAIRSEQPY:$PYTHONPATH CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3.6 $FAIRSEQPY/train.py <All other training specific flags> --distributed-world-size 16 --distributed-rank 8 --distributed-backend "nccl" --distributed-init-method 'tcp://54.146.137.72:9001' --distributed-port 9001
on second node I got the following error log.
Traceback (most recent call last): File "/home/<user>/mlconvgec2018_2019_06_25_1/mlconvgec2018/software//fairseq-py/train.py", line 347, in <module> distributed_main(args) File "/home/<user>/mlconvgec20/18_2019_06_25_1/mlconvgec2018/software/fairseq-py/distributed_train.py", line 37, in main args.distributed_rank = distributed_utils.distributed_init(args) File "/home/<user>/mlconvgec2018_2019_06_25_1/mlconvgec2018/software/fairseq-py/fairseq/distributed_utils.py", line 28, in distributed_init world_size=args.distributed_world_size, rank=args.distributed_rank) File "/home/<user>/mlconvgec2018_2019_06_25_1/venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 94, in init_process_group group_name, rank) RuntimeError: could not establish connection with other processes at /pytorch/torch/lib/THD/process_group/General.cpp:17
NCCL version: 2.4.8
PyTorch Version: 1.1.0
CUDA version: 9.2
I have generated ens3 by using ifconfig
command. I was actually referring this documentation. I’m using AWS cloud platform. Right now I’m not using shared file system. I have copy of code and data on 2 nodes each node is having 8 GPUs.
Is there something that I’m missing?
Any help is much appreciated.
Thanks,
Jalaj