Validation hangs up when using DDP and syncbatchnorm

sunshichen · December 2, 2020, 7:04am

I’m using DDP(one process per GPU) to training a 3D UNet. I transfered all batchnorm layer inside network to syncbatchnorm with nn.SyncBatchNorm.convert_sync_batchnorm.

When doing validation at the end of every training epoch on rank 0, it always freeze at same validation steps. I think it is because of the syncbatchnorm layer. What is the correct way to do validation when DDP model has syncbatchnorm layer? Should I do validation on all ranks?

Code

for epoch in range(epochs):
    model.train()
    train_loader.sampler.set_epoch(epoch)
    for step, (data,  target) in enumerate(train_loader):
        # ...training codes
        train(model)
    if dist.get_rank() == 0:
        # ...validation codes
        model.eval()
        validate(model)
    dist.barrier()

Version / os

torch = 1.1.0
ubuntu 18.04
distributed backend: nccl

Freeze during validation with distributed training and model with batch normalization layers

opened 09:53PM - 16 May 19 UTC

closed 12:02AM - 21 May 19 UTC

SweetVlad

oncall: distributed triaged

## ❓ Questions and Help Hi, I got unexpected behavior during training with tor…ch.distributed.DistributedDataParallel model on multiple GPUs. - I train my model with DistributedSampler and DataLoader from torch lib. During training everything is fine, but when I validate my model after the first epoch of training, everything gets stuck - no crash, but GPUs are showing 100 % utilization as well as CPUs, but nothing happens. - If I remove BatchNorm2d layers from the model, everything runs just fine. - If I use model with BatchNorm2d layers with world size 1 (only one GPU but setup through DistributedDataParallel), everything is also fine. I assume that different batchnorm values are not correctly synced across gpus or any idea what am I doing wrong? Thanks. ## version / os dist backend -"nccl", Pytorch 1.1, Cuda 10, ubuntu 16.04

ptrblck · December 4, 2020, 6:25am

Could you update to the latest stable release or the nightly binary and check, if you are still facing the error? 1.1.0 is quite old by now and this issue might have been already fixed.

pritamdamania87 · December 4, 2020, 10:41pm

Yes, you probably need to do validation on all ranks since SyncBatchNorm has collectives which are expected to run on all ranks. The validation is probably getting stuck since SyncBatchNorm on rank 0 is waiting for collectives from other ranks.

Another option is to convert the SyncBatchNorm layer to a regular BatchNorm layer and then do the validation on a single rank.

sunshichen · December 7, 2020, 6:24am

Thanks I will try it.
Actually I have another question about v1.1.0 DDP.
I tried to inference the model with syncbatchnorm layer ( Actually, it becomes batchnorm layer after load from checkpoint ). The results turned to be different between:

Only turn on evaluate mode.

model.eval()
# inference...

Manually set track_running_stats of each BN layer to False after model.eval().

model.eval()
set_BN_track_running_stats(model, False)
# do inference..

It is strange that the second one is much better than first one on early epochs. Is this also a version problem?

P.S. However, after more epochs training, results of two inference method are similiar but still has small differences.

Below is sample code of set_BN_track_running_stats():

def set_BN_track_running_stats(module, ifTrack=True):
    if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
        module.track_running_stats = ifTrack
    for child in module.children():
        set_BN_track_running_stats(module = child, ifTrack=ifTrack)

sunshichen · December 7, 2020, 6:27am

Yes. Validate on all GPUs works. But shouldn’t model.eval() aotomatically disable sync process?

BTW, thanks for your reply.

pritamdamania87 · December 8, 2020, 1:51am

Are you referring to PyTorch v1.1.0 here? If so, I’d suggest upgrading to PyTorch 1.7 which is the latest version to see if this problem still persists.

pritamdamania87 · December 8, 2020, 1:57am

Good point, I’ve opened an issue for this: SyncBatchNorm should avoid sync when model.eval() is True. · Issue #48988 · pytorch/pytorch · GitHub

sunshichen · December 8, 2020, 2:54am

Thanks greatly for your reply. I will try v1.7.

sunshichen · December 9, 2020, 6:34am

I upgrade torch to 1.7 today. The problem is gone. I think it is a 1.1.0 problem. Thanks again for you help.

sunshichen · December 14, 2020, 6:19am

Actually. I met another problem after I upgrade to V1.7.0. The result cames to be much worse than it on 1.1. Could you help me with that?

sunshichen · December 14, 2020, 6:20am