I’m training a conv model using DataParallel (DP) and DistributedDataParallel (DDP) modes. For DDP, I only use it on a single node and each process is one GPU.

My model has many BatchNorm2d layers. Given all other things the same, I observe that DP trains better than DDP (in classification accuracy). Even if I add `SyncBN`

from pytorch 1.1, I still observe that DP > DDP+SyncBN > DDP without SyncBN in test accuracy.

I’m aware of the difference between DP and DDP’s handling of averaging/sum: Is average the correct way for the gradient in DistributedDataParallel with multi nodes?

The LR and total batch size are the same for both DP, DDP+SyncBN, and DDP.

If I understand correctly, DP doesn’t do SyncBN, so DP should in theory achieve the same test accuracy as DDP (given small batch size per GPU)? If we assume larger *effective* batch size leads to better result, I should expect the following test performance ranking:

DDP+SyncBN > DP == DDP

but in practice, I observe: DP > DDP+SyncBN > DDP

Because DDP+SyncBN is 30% faster than DP, I really hope to solve the training gap so that I can take advantage of DDP’s superior speed. Thanks for any help!