Training freezes before forward pass when number of GPUs are increased in distributed training

I made some changes to the model’s forward pass in VL-BERT repository.
I was able to successfully run my script for training over multiple (7) GPUs. However after some time, suddenly, my code freezes when I increase the number of GPUs to more than 2 GPUs. I did not make any changes to the way model is passed for distributed training.

This is how I increase the number of GPUs used:

CUDA_VISIBLE_DEVICES=1,2,3,4 ./scripts/dist_run_single.sh 4 pretrain/train_end2end.py ./cfgs/contrastive_pretrain/base_prec_random_movienet_images_4x16G_fp32.yaml ./checkpoints_debugcv04
Gets stuck if GPUs are more than 2

instead of

CUDA_VISIBLE_DEVICES=1,2 ./scripts/dist_run_single.sh 2 pretrain/train_end2end.py ./cfgs/contrastive_pretrain/base_prec_random_movienet_images_4x16G_fp32.yaml ./checkpoints_debugcv04
Starts training successfully

The model is loaded on each of the local ranks successfully.
The script also enters the train function on each rank : https://github.com/jackroos/VL-BERT/blob/4373674cbf2bcd6c09a2c26abfdb6705b870e3be/common/trainer.py#L56

However, the forward pass doesn’t proceed.
I am using the latest version of PyTorch 1.7.0
What might be going wrong here? I assume some synchronization problems might be occurring with more than 2 GPUs
Thanks

However after some time, suddenly, my code freezes when I increase the number of GPUs to more than 2 GPUs.

At what point does the training get stuck? Do you have any logs outputted until the point the training gets stuck (ideally with NCCL_DEBUG=WARN)?

Also how many GPUs does this host have? Do you run into the same issue on other multi-GPU hosts as well?

I assume some synchronization problems might be occurring with more than 2 GPUs

This usually only becomes a major issue at much larger numbers of GPUs, so it should be able to handle more than 2. Are there any factors that could cause synchronization issues, such as some GPUs being significantly slower than others/other jobs or processes using those GPUs while you were training?

Hi @osalpekar, thanks for your response

I have figured out that the training runs fine when I remove the metric logger for MLMAccuracy
The place where it gets stuck is when the get() function is called in MLMAccuracy when writing to tensorboard and logger. Somehow it is not able to do all_reduce here for sum_metric for this particular metric when more number of GPUs are there.

In case it makes sense to see the log files as you suggested:
I have posted the output with NCCL_DEBUG=WARN in 2 files here for 2 cases (freezes and works_fine): https://gist.github.com/amogh112/84a27280e69b983ea88497892e3855cb
And the comment at the end shows the output for NCCL_DEBUG=INFO for different cases with 1,2,6 GPUs

Can you verify that this allreduce gets called on all ranks? One possibility could be that some ranks don’t invoke this allreduce which would result in a freeze.