Encounter Error while running distributed training on fairseq

jalajthanaki · October 3, 2019, 1:19pm

Hi PyTorch Community Members,

I am trying to run distributed training on 2 nodes with 8 GPUs each (K80) in total 16 GPUs. I’m using following NCCL as backend and along with that I’m using following command to execute the distributed training.
I have set two NCCL environment flag

$ export NCCL_SOCKET_IFNAME=ens3
$ export NCCL_DEBUG=INFO

On 1st node I’m executing the fairseq training command with following distributed training flags:

PYTHONPATH=$FAIRSEQPY:$PYTHONPATH CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3.6 $FAIRSEQPY/train.py <ALL other training specific flags>... --distributed-world-size 16 --distributed-rank 0 --distributed-backend "nccl" --distributed-init-method 'tcp://54.146.137.72:9001' --distributed-port 9001

on 2nd node I’m executing the fairseq training command with following distributed training flags:

PYTHONPATH=$FAIRSEQPY:$PYTHONPATH CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3.6 $FAIRSEQPY/train.py <All other training specific flags> --distributed-world-size 16 --distributed-rank 8 --distributed-backend "nccl" --distributed-init-method 'tcp://54.146.137.72:9001' --distributed-port 9001

on second node I got the following error log.

Traceback (most recent call last): File "/home/<user>/mlconvgec2018_2019_06_25_1/mlconvgec2018/software//fairseq-py/train.py", line 347, in <module> distributed_main(args) File "/home/<user>/mlconvgec20/18_2019_06_25_1/mlconvgec2018/software/fairseq-py/distributed_train.py", line 37, in main args.distributed_rank = distributed_utils.distributed_init(args) File "/home/<user>/mlconvgec2018_2019_06_25_1/mlconvgec2018/software/fairseq-py/fairseq/distributed_utils.py", line 28, in distributed_init world_size=args.distributed_world_size, rank=args.distributed_rank) File "/home/<user>/mlconvgec2018_2019_06_25_1/venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 94, in init_process_group group_name, rank) RuntimeError: could not establish connection with other processes at /pytorch/torch/lib/THD/process_group/General.cpp:17

NCCL version: 2.4.8
PyTorch Version: 1.1.0
CUDA version: 9.2

I have generated ens3 by using ifconfig command. I was actually referring this documentation. I’m using AWS cloud platform. Right now I’m not using shared file system. I have copy of code and data on 2 nodes each node is having 8 GPUs.

Is there something that I’m missing?
Any help is much appreciated.

Thanks,
Jalaj

zhangguanheng66 · October 3, 2019, 1:58pm

I suggest you to open up an issue on pytorch/issues.

pietern · October 3, 2019, 2:53pm

The error mentions THD, which implies you’re using an older version of PyTorch.

Can you double check the version you’re using?

Also, can you confirm 54.146.137.72 is indeed the IP address of the machine hosting rank 0?

jalajthanaki · October 15, 2019, 3:50pm

Thank you @pietern and @zhangguanheng66 for your suggestion. I have modify IP address and NCCL environment variable but now getting different error. I have referred the following issues to resolve the issue but seems it didn’t help me much.

I have simple multinode GPU architecture 2 nodes in total and 1 GPU on each node so total GPUs are 2.

LOG on Worker node:

Traceback (most recent call last):
  File "software//fairseq-py/train.py", line 347, in <module>
    distributed_main(args)
  File "software/fairseq-py/distributed_train.py", line 39, in main
    single_process_main(args)
  File "software/fairseq-py/train.py", line 87, in main
    train(args, trainer, task, epoch_itr)
  File "software/fairseq-py/train.py", line 125, in train
    log_output = trainer.train_step(sample, update_params=True)
  File "software/fairseq-py/fairseq/trainer.py", line 137, in train_step
    (sample_sizes, logging_outputs, ooms_fwd, ooms_bwd)
  File "software/fairseq-py/fairseq/distributed_utils.py", line 77, in all_gather_list
    torch.distributed.all_gather(out_buffers, in_buffer.cuda())
  File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 439, in all_gather
    return all_gather_multigpu([tensor_list], [tensor], group)
  File "venv/lib/python3.6/site-packages/torch/distributed/__init__.py", line 413, in all_gather_multigpu
    group)
RuntimeError: NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error

CUDA 10.1
CUDANN 7.6.4
NCCL 2.4.6
Pytorch 1.1.0

NCCL environment variables

export NCCL_SOCKET_IFNAME=ens3
export NCCL_DEBUG=INFO
export NCCL_IB_CUDA_SUPPORT=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=1
export NCCL_NET_GDR_LEVEL=3
export NCCL_NET_GDR_READ=0
export NCCL_SHM_DISABLE=0

I have run nccl-test using this command it run perfectly. ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1

According to me CUDA, CudaNN and NCCL version are compatible with each other. Is there anything I’m missing? Any help or suggestion is appreciable.

Thanks,