In one of my recent project I have to leverage Distributed dataparallel/dataparallel due to the memory constraint. But simply wrapping up the model with DP/DDP and training it lead to a slightly worse performance than the original single card training (exact same setting).
Would it be possible to close this gap? Or, more specifically, what are the differences under the hood (e.g. Batchnorm should be replaced with SyncBatchnorm, etc.)
@zxhuang1698 can you clarify a bit about your question, i’m not sure I get it, are you specifically referring to DistributedDataParallel?
In usual cases, Distributed training usually requires communication to sync model’s gradients and thus usually it is not linear scaling compared to local training.
@wanchaol sorry for the confusion. For performance, I do not mean the computational efficiency, rather I am talking about the performance of the task (e.g. loss, accuracy, IOU, etc.)
In the ideal case, I would assume training on multiple cards should lead to the same e.g. loss curve as training on a single card. I know randomness exists, but I do see multi-gpu models (both DP and DDP) perform worse than single-gpu models constantly in my setup, where they share the exact same training configs. So I am wondering whether there are any behaviors of DP/DDP that can cause such gap.
For example, I remember DP/DDP do not handle batchnorm in a synchronized way by default, where they calculate statistics on each card locally. This might cause performance to degrade when batchsize is small. Say we have a batchsize of 8 and four cards, then the batch statistics are calculated with an effective batchsize of 2 and can have high variance, compared to training the same model on a single card of batchsize 8.