in our project using multiple gpus for training a resnet50 model with PyTorch and DistributedDataParallel, I encountered a problem. Here is the github-link for our project.
Looking at the comparison of the validation accuracy progress after each epoch between a single GPU and multiple GPUs, it looks like the GPUs don’t share their training results with each other and it’s actually just one GPU training the model.
Here is a comparison between a single Nvidia RTX 3090 and 4 Nvidia RTX 3090 using the same training parameters trained with the imagenet dataset.
If the epoch axes of the 4x3090 graph is divided by the number of gpus (green curve), the graph has a similar shape like the one of the single GPU (blue curve). So basically training with 4 GPUS needs 4 epochs to get the same results like a single GPU achieves in only 1 epoch. Since the dataset is distributed over all GPUS, each GPU only uses 1/4 of the whole dataset, explaining the worse training result for each GPU after each epoch.
Playing around with training parameters like using different batch sizes and learning rates gives me the same result. I have another comparison with different training parameters following in the next post because of the limit of embedded media for new users.
No matter what batch size or learning rate is used, training with a single GPU is always much more efficient then training with multiple GPUS using DistributedDataParallel.
I also ran different tests with smaller datasets and different training parameters with the same effect.
Using DataParallel instead of DistributedDataParallel works as expected: The training of multiple GPUS with a batchsize divided by the number of GPUS gives similar results as the training of a single GPU with the full batchsize.
But since DataParallel is much slower than DistributedDataParallel, I would be very grateful, if someone could give me a hint what I am doing wrong.
Thank you in advance