I’m training a conv model using DataParallel (DP) and DistributedDataParallel (DDP) modes. For DDP, I only use it on a single node and each process is one GPU.
My model has many BatchNorm2d layers. Given all other things the same, I observe that DP trains better than DDP (in classification accuracy). Even if I add
SyncBN from pytorch 1.1, I still observe that DP > DDP+SyncBN > DDP without SyncBN in test accuracy.
I’m aware of the difference between DP and DDP’s handling of averaging/sum: Is average the correct way for the gradient in DistributedDataParallel with multi nodes?
The LR and total batch size are the same for both DP, DDP+SyncBN, and DDP.
If I understand correctly, DP doesn’t do SyncBN, so DP should in theory achieve the same test accuracy as DDP (given small batch size per GPU)? If we assume larger effective batch size leads to better result, I should expect the following test performance ranking:
DDP+SyncBN > DP == DDP
but in practice, I observe: DP > DDP+SyncBN > DDP
Because DDP+SyncBN is 30% faster than DP, I really hope to solve the training gap so that I can take advantage of DDP’s superior speed. Thanks for any help!