I made some changes to the model’s forward pass in VL-BERT repository.
I was able to successfully run my script for training over multiple (7) GPUs. However after some time, suddenly, my code freezes when I increase the number of GPUs to more than 2 GPUs. I did not make any changes to the way model is passed for distributed training.
This is how I increase the number of GPUs used:
CUDA_VISIBLE_DEVICES=1,2,3,4 ./scripts/dist_run_single.sh 4 pretrain/train_end2end.py ./cfgs/contrastive_pretrain/base_prec_random_movienet_images_4x16G_fp32.yaml ./checkpoints_debugcv04 Gets stuck if GPUs are more than 2
instead of
CUDA_VISIBLE_DEVICES=1,2 ./scripts/dist_run_single.sh 2 pretrain/train_end2end.py ./cfgs/contrastive_pretrain/base_prec_random_movienet_images_4x16G_fp32.yaml ./checkpoints_debugcv04 Starts training successfully
However, the forward pass doesn’t proceed.
I am using the latest version of PyTorch 1.7.0
What might be going wrong here? I assume some synchronization problems might be occurring with more than 2 GPUs
Thanks
However after some time, suddenly, my code freezes when I increase the number of GPUs to more than 2 GPUs.
At what point does the training get stuck? Do you have any logs outputted until the point the training gets stuck (ideally with NCCL_DEBUG=WARN)?
Also how many GPUs does this host have? Do you run into the same issue on other multi-GPU hosts as well?
I assume some synchronization problems might be occurring with more than 2 GPUs
This usually only becomes a major issue at much larger numbers of GPUs, so it should be able to handle more than 2. Are there any factors that could cause synchronization issues, such as some GPUs being significantly slower than others/other jobs or processes using those GPUs while you were training?
In case it makes sense to see the log files as you suggested:
I have posted the output with NCCL_DEBUG=WARN in 2 files here for 2 cases (freezes and works_fine): https://gist.github.com/amogh112/84a27280e69b983ea88497892e3855cb
And the comment at the end shows the output for NCCL_DEBUG=INFO for different cases with 1,2,6 GPUs
Can you verify that this allreduce gets called on all ranks? One possibility could be that some ranks don’t invoke this allreduce which would result in a freeze.