Training with DDP and SyncBatchNorm hangs at the same training step on the first epoch

ChickenTarm · January 24, 2021, 6:03am

I am using pytorch-lightning as my training framework. And I am have tried training on 1, 2, 4 GPUs (all T4). My model, video action classification network, hangs at the same spot each time. It only hangs when I set the trainer flags

Trainer(
    gpus=(something greater than 1)
    sync_batchnorm=True,
    accelerator="ddp"
)

I noticed that when it hangs GPU utilization stays pinned at 100% with no power fluctuations.
I am able to train my model with sync_batchnorm=False.
Does anyone have experience or tips on what a solution might be or how to properly debug this?

I have also tested this on 2 V100s and it hangs not at the exact same spot but same issue.

Version/OS:
Ubuntu 18.04LTS
CUDA 11.1
Driver Version 455.45.01
Pytorch 1.7.1
Pytorch-Lightning 1.1.5

ChickenTarm · January 25, 2021, 6:12am

[Solved] My problem was that I have random alternating training that go down different branches of my model. I needed to set the random seed that samples the probability of which alternating loss it will perform. This is probably because when pytorch does it reduce_all somewhere, it notices a difference in batch norm statistics since I believe it assumes some ordering on the statistics.

rvarm1 · January 26, 2021, 12:20am

This sounds correct, there would probably be a hang if a different alternating loss is used on different ranks.

ChickenTarm · January 26, 2021, 3:50am

I think this should probably be noted in the docs since alternating training isn’t uncommon.