Hi, I noticed that when I am using DDP with 8 GPU or a single GPU to train on the same dataset, the loss plot is very different (DDP loss is higher), and it seems it takes more epoch to make DDP’s loss decrease to the single GPU’s loss. My question is:
Is the single GPU loss the same loss which can compared to DDP loss? If they are different, is that means if we use 2 GPU, 4 GPU or 16 GPU and get 3 loss, we cannot caompare them?
Can we really say if we train both 10 epochs for DDP or single GPU, the output two models are almost the same? Or the single GPU’s model is better because we have lower loss?
Since you didn’t mention it in your post, how are you adjusting the batch size and learning rate when scaling to more GPUs? Note that you would want to increase the learning rate when the number of GPUs is increased even if the per gpu batch size is the same in both setups, as the effective global batch size will be the dataloader batch size times the number of GPUs.
As for your second point, it is a known issue that there are diminishing returns and scaling of sample efficiency when batch size is scaled indefinitely, so it could be the case that the number of required epochs is not exactly the same between setups with differing numbers of GPUs. Naively the model with lower loss would be better, and you can consult the literature as model quality vs. batch size and sample efficiency vs. batch size is heavily studied topic.
I have adjusted the batch size so multi-gpu global batch size = single GPU batch size. In this case do I still need to adjust learning rate to get the same performance of two model? The first intuition comes to my mind is their loss plot should be the same but they are different, the multi-gpu seems to converge slower.
So that is why I asked the second problem, for single GPU training and multi-gpu training although we have the same batch size, their loss has different trend. Can we compare them to say, oh lower loss better, or we just cannot compare them because they have different training property?
Perhaps I should not be chiming in on this since I’m relatively new to the whole multiGPU training paradigm, but:
I have adjusted the batch size so multi-gpu global batch size = single GPU batch size.
Hopefully someone will correct me if I’m wrong here, but isn’t the reason behind the speedups of multiGPU training purely that it enables much larger global batch sizes, allowing us to crunch through large data sets in fewer passes?
Unless of course you only set multiGPU global bsize = single GPU bsize to conduct tests on the loss trend. In which case, ignore me
In theory, slower convergence would be unexpected at the same batch size. Are you using normalization layers in your model? I’m wondering since you are keeping the global batch size the same if the smaller batch size per GPU could be interfering with normalization layers.