I am using pytorch-lightning as my training framework. And I am have tried training on 1, 2, 4 GPUs (all T4). My model, video action classification network, hangs at the same spot each time. It only hangs when I set the trainer flags
Trainer( gpus=(something greater than 1) sync_batchnorm=True, accelerator="ddp" )
I noticed that when it hangs GPU utilization stays pinned at 100% with no power fluctuations.
I am able to train my model with sync_batchnorm=False.
Does anyone have experience or tips on what a solution might be or how to properly debug this?
I have also tested this on 2 V100s and it hangs not at the exact same spot but same issue.
Driver Version 455.45.01