Slow convergence of loss for Distributed Data Parallel in mulit-GPU environment

Hi,

I’m training my model on 2-gpu system with CUDA 11.0 & PyTorch 1.7.1. Using DDP, I see that on a single GPU model loss is converging faster. But when use both GPUs slow convergence.

What could be the problem? Tried initializing the random seed, same results.

1-GPU Epochs with DDP

===> Epoch 0 Complete: Avg. Loss: 0.36975980444256995
===> Epoch 1 Complete: Avg. Loss: 0.32454686222479784
===> Epoch 2 Complete: Avg. Loss: 0.3071180762120964
===> Epoch 3 Complete: Avg. Loss: 0.2750444378917671
===> Epoch 4 Complete: Avg. Loss: 0.24473399923287129
===> Epoch 5 Complete: Avg. Loss: 0.21892599486872508
===> Epoch 6 Complete: Avg. Loss: 0.20032298285795483
===> Epoch 7 Complete: Avg. Loss: 0.1875871649501547
===> Epoch 8 Complete: Avg. Loss: 0.17785485737093265
===> Epoch 9 Complete: Avg. Loss: 0.17077650971643155

2-GPU Epochs with DDP:

===> Epoch 0 Complete: Avg. Loss: 0.4675159709281232

> ===> Epoch 1 Complete: Avg. Loss: 0.33464150040982715
> ===> Epoch 2 Complete: Avg. Loss: 0.330990740333695
> ===> Epoch 3 Complete: Avg. Loss: 0.3278805889997138
> ===> Epoch 4 Complete: Avg. Loss: 0.3254638583545225
> ===> Epoch 5 Complete: Avg. Loss: 0.3231941443609904
> ===> Epoch 6 Complete: Avg. Loss: 0.3184018903468029
> ===> Epoch 7 Complete: Avg. Loss: 0.3123437018997698
> ===> Epoch 8 Complete: Avg. Loss: 0.30351564180420104
> ===> Epoch 9 Complete: Avg. Loss: 0.29410275745104597

Regards,
MJay

Hi, I’m confronted with a similiar problem, maybe even worth.

For a model training on ModelNet 40 dataset, with the same setting, when using a single GPU, the error decreases to 5e-4 and accuracy achieves 80% in 88 epochs.

However, using 2 GPUs wrapped in DDP, at epoch 88 the error was around 15e-4 and accuracy 48%; using 4 GPUs, at epoch 88 was around 14e-4 and accuracy around 53%. And in either case, the error and accuracy fail to be improved effectively given more epochs.

For the possibility of batch size of one, I didn’t use batchnorm, instead I applied layernorm. Therefore no SyncNorm was added. Is this choice of normalization the reason for such huge discrepancy? Also I used torch.cuda.amp throughout.

Hope to have some discussions.

The platform is Pytorch 1.7.1 with cuda 10.1.

Did you ever resolve this problem? I am having a similar problem