I’m running into problems with training (fairseq code) across 2 machines. The script worked in one of our cloud environments, but not in another and I’m trying to figure out why. The drivers are not exactly the same across the machines but we don’t have permissions to fix that in the second environment.
The following code:
Code sample
NUM_NODES=2
NODE_RANK=0
MASTER_IP=192.168.0.34
MASTER_PORT=1234
DATA_DIR=~/wikitext_103
# Change the above for every node #####
TOTAL_UPDATES=125000 # Total number of training steps
WARMUP_UPDATES=10000 # Warmup the learning rate over this many updates
PEAK_LR=0.0005 # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512 # Max sequence length
MAX_POSITIONS=512 # Num. positional embeddings (usually same as above)
MAX_SENTENCES=4 # Number of sequences per batch (batch size)
UPDATE_FREQ=24 # Increase the batch size 16x
python3 -m torch.distributed.launch --nproc_per_node=1 \
--nnodes=$NUM_NODES --node_rank=$NODE_RANK --master_addr=$MASTER_IP \
--master_port=$MASTER_PORT \
$(which fairseq-train) --fp16 $DATA_DIR \
--task masked_lm --criterion masked_lm \
--arch roberta_large --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
--max-update $TOTAL_UPDATES --log-format simple --log-interval 1
yields the following error:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/tmp/src/fairseq/fairseq_cli/train.py", line 281, in distributed_main
main(args, init_distributed=True)
File "/tmp/src/fairseq/fairseq_cli/train.py", line 46, in main
args.distributed_rank = distributed_utils.distributed_init(args)
File "/tmp/src/fairseq/fairseq/distributed_utils.py", line 100, in distributed_init
dist.all_reduce(torch.zeros(1).cuda())
File "/usr/local/lib64/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:78, invalid argument, NCCL version 2.4.8
Any tips or hints for where to look would be greatly appreciated!
Environment
PyTorch Version:1.4.0
fairseq Version: 0.9.0
OS: CentOS Linux release 7.6.1810
Python version: 3.6.8
CUDA/cuDNN version: Cuda compilation tools, release 10.2, V10.2.89
GPU models and configuration: V100’s across 2 machines