I am using pytorch-lightning as my training framework. And I am have tried training on 1, 2, 4 GPUs (all T4). My model, video action classification network, hangs at the same spot each time. It only hangs when I set the trainer flags
Trainer(
gpus=(something greater than 1)
sync_batchnorm=True,
accelerator="ddp"
)
I noticed that when it hangs GPU utilization stays pinned at 100% with no power fluctuations.
I am able to train my model with sync_batchnorm=False.
Does anyone have experience or tips on what a solution might be or how to properly debug this?
I have also tested this on 2 V100s and it hangs not at the exact same spot but same issue.
Version/OS:
Ubuntu 18.04LTS
CUDA 11.1
Driver Version 455.45.01
Pytorch 1.7.1
Pytorch-Lightning 1.1.5