Equivalence between DP and DDP

I was using DP for training my models and later swtiched to DDP, however I noticed a significant performance drop after switching to DDP. I’ve double checked and made sure that data batches (size, sampling, random seeds, etc.) are consistent in two senarios, and have modified learning rate according to the “proportional to batch size” guideline as in the “Train ImageNet in 1 hour” paper. However I still got the performance drop with DDP.

Is this expected? My understanding is that if we make sure the model sees the same data and learning rate (and of course start from the same initialization), DP and DDP training should produce the same model? Am I missing anything? Are there other factors that will like lead to differences, say, loss function, batch norms?

Thanks!

Could you provide a script that can reproduce this issue?

Are you sure that you used the same learning rate, loss function, optimizer, same number of epochs, etc?

Both DP and DDP should be able to produce a model that can be also output by PyTorch w/o using DP or DDP, as long as the gradients are synced at every step. What about the performance w/o using DP or DDP?

What’s the loss function? If the loss function is not commutative, then it make result in a difference.