Crash when initializing distributed training across 2 machines

aronl · March 9, 2020, 9:40am

I’m running into problems with training (fairseq code) across 2 machines. The script worked in one of our cloud environments, but not in another and I’m trying to figure out why. The drivers are not exactly the same across the machines but we don’t have permissions to fix that in the second environment.
The following code:

Code sample

NUM_NODES=2
NODE_RANK=0
MASTER_IP=192.168.0.34
MASTER_PORT=1234
DATA_DIR=~/wikitext_103
# Change the above for every node #####


TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=4        # Number of sequences per batch (batch size)
UPDATE_FREQ=24          # Increase the batch size 16x

python3 -m torch.distributed.launch --nproc_per_node=1 \
    --nnodes=$NUM_NODES --node_rank=$NODE_RANK --master_addr=$MASTER_IP \
    --master_port=$MASTER_PORT \
    $(which fairseq-train) --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_large --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1

yields the following error:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/tmp/src/fairseq/fairseq_cli/train.py", line 281, in distributed_main
    main(args, init_distributed=True)
  File "/tmp/src/fairseq/fairseq_cli/train.py", line 46, in main
    args.distributed_rank = distributed_utils.distributed_init(args)
  File "/tmp/src/fairseq/fairseq/distributed_utils.py", line 100, in distributed_init
    dist.all_reduce(torch.zeros(1).cuda())
  File "/usr/local/lib64/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:78, invalid argument, NCCL version 2.4.8

Any tips or hints for where to look would be greatly appreciated!

Environment

PyTorch Version:1.4.0
fairseq Version: 0.9.0
OS: CentOS Linux release 7.6.1810
Python version: 3.6.8
CUDA/cuDNN version: Cuda compilation tools, release 10.2, V10.2.89
GPU models and configuration: V100’s across 2 machines

ptrblck · March 9, 2020, 11:44pm

Could you rerun your script with NCCL_DEBUG=INFO and post the output, please?

aronl · March 10, 2020, 6:49am

Absolutely:

| distributed init (rank 1): env://
| distributed init (rank 5): env://
| distributed init (rank 3): env://
| distributed init (rank 4): env://
| distributed init (rank 0): env://
| distributed init (rank 7): env://
| initialized host seskscpg054.prim.scp as rank 7
| initialized host seskscpg054.prim.scp as rank 0
| distributed init (rank 6): env://
| initialized host seskscpg054.prim.scp as rank 6
| distributed init (rank 8): env://
| initialized host seskscpg054.prim.scp as rank 8
| distributed init (rank 2): env://
| initialized host seskscpg054.prim.scp as rank 2
| distributed init (rank 9): env://
| initialized host seskscpg054.prim.scp as rank 9
| initialized host seskscpg054.prim.scp as rank 1
| initialized host seskscpg054.prim.scp as rank 5
| initialized host seskscpg054.prim.scp as rank 3
| initialized host seskscpg054.prim.scp as rank 4
seskscpg054:70822:70822 [0] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70822:70822 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

seskscpg054:70822:70822 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
seskscpg054:70822:70822 [0] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
NCCL version 2.4.8+cuda10.1
seskscpg054:70824:70824 [2] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70829:70829 [7] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70831:70831 [9] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70824:70824 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
seskscpg054:70829:70829 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
seskscpg054:70831:70831 [9] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

seskscpg054:70824:70824 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]

seskscpg054:70829:70829 [7] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]

seskscpg054:70831:70831 [9] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
seskscpg054:70828:70828 [6] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70830:70830 [8] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70828:70828 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
seskscpg054:70830:70830 [8] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

seskscpg054:70828:70828 [6] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]

seskscpg054:70830:70830 [8] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
seskscpg054:70829:70829 [7] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70824:70824 [2] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>

seskscpg054:70831:70831 [9] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>

seskscpg054:70829:70829 [7] init.cc:981 NCCL WARN Invalid rank requested : 7/2

seskscpg054:70824:70824 [2] init.cc:981 NCCL WARN Invalid rank requested : 2/2

seskscpg054:70831:70831 [9] init.cc:981 NCCL WARN Invalid rank requested : 9/2
seskscpg054:70828:70828 [6] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70830:70830 [8] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>

seskscpg054:70828:70828 [6] init.cc:981 NCCL WARN Invalid rank requested : 6/2

seskscpg054:70830:70830 [8] init.cc:981 NCCL WARN Invalid rank requested : 8/2
seskscpg054:70822:71520 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
seskscpg054:70827:70827 [5] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70827:70827 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

seskscpg054:70827:70827 [5] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
seskscpg054:70827:70827 [5] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>

seskscpg054:70827:70827 [5] init.cc:981 NCCL WARN Invalid rank requested : 5/2
seskscpg054:70823:70823 [1] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70823:70823 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

seskscpg054:70823:70823 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
seskscpg054:70823:70823 [1] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70826:70826 [4] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70826:70826 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

seskscpg054:70826:70826 [4] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
seskscpg054:70826:70826 [4] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>

seskscpg054:70826:70826 [4] init.cc:981 NCCL WARN Invalid rank requested : 4/2
seskscpg054:70825:70825 [3] NCCL INFO Bootstrap : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>
seskscpg054:70825:70825 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

seskscpg054:70825:70825 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
seskscpg054:70825:70825 [3] NCCL INFO NET/Socket : Using [0]enp124s0f0:10.96.65.162<0> [1]enp124s0f1:fe80::42a6:b7ff:fe02:f131%enp124s0f1<0>

seskscpg054:70825:70825 [3] init.cc:981 NCCL WARN Invalid rank requested : 3/2
seskscpg054:70823:71521 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
seskscpg054:70822:71520 [0] NCCL INFO Channel 00 :    0   1
seskscpg054:70823:71521 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/IPC
seskscpg054:70822:71520 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
seskscpg054:70823:71521 [1] NCCL INFO comm 0x7f909c001e30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE

Thank you for taking the time!

naykun · October 22, 2020, 7:43pm

Same error here. Did you resolve this issue?

aronl · October 22, 2020, 7:58pm

Never got to the bottom of the problem unfortunately, but after reinstalling everything on all machines, the error disappeared and it ran smoothly.