I was using DP for training my models and later swtiched to DDP, however I noticed a significant performance drop after switching to DDP. I’ve double checked and made sure that data batches (size, sampling, random seeds, etc.) are consistent in two senarios, and have modified learning rate according to the “proportional to batch size” guideline as in the “Train ImageNet in 1 hour” paper. However I still got the performance drop with DDP.
Is this expected? My understanding is that if we make sure the model sees the same data and learning rate (and of course start from the same initialization), DP and DDP training should produce the same model? Am I missing anything? Are there other factors that will like lead to differences, say, loss function, batch norms?
Thanks!