Decreasing GPU usage as increasing number of GPUs for training

Hi community.

I am writing you today because I have profiled different models in single GPU and multi GPU and I found that as I increase the number of GPUs for the model to train on the GPU usage of each GPU decreases. I would like to know if this is normal or if there is a way to increase GPU usage in the multi GPU set up.

The profiling I did comprises, different model complexity, different number of DataLoader workers, and different number of GPUs (with DataParallel):

  • Models: Efficiendet-d0, d2, d4 (input resolutions of 512, 768, 1024 to give an idea)
  • #Workers: Varying from 1 to batch_size * 3
  • #GPUs: Varying from 1, 2, and 4 RTX 2080 Ti

These were my findings:

Model GPU (x1) GPU [%] Workers Batch size Batch/s Image/s
d0 RTX 2080 Ti 90-93 4 8 4.48 35.8
d2 RTX 2080 Ti 91-92 1 3 3.02 9.06
d4 RTX 2080 Ti 93-96 1 1 1.85 1.85
Model GPU (x2) GPU [%] Workers Batch size Batch/s Image/s
d0 RTX 2080 Ti 70-89 24 16 3.35 53.6
d2 RTX 2080 Ti 72-83 6 6 2.33 13.98
d4 RTX 2080 Ti 80-92 1 2 1.50 3
Model GPU (x4) GPU [%] Workers Batch size Batch/s Image/s
d0 RTX 2080 Ti 45-62 8 32 2.28 72.96
d2 RTX 2080 Ti 58-70 6 12 1.78 21.36
d4 RTX 2080 Ti 65-82 2 4 1.16 4.64

The conclusions are:

  • #Workers: increasing number of workers translates to increase GPU usage on simple model EfficientDet-d0, but when working with EfficientDet-d2,d4 the #workers had little to no influence
  • Model complexity: The more complex the model is the more GPU uses
  • #GPUs: Increasing the number of GPUs available for a model to train translates in decreasing the GPU usage that the model uses on each of those GPU

While the first two points make sense to me, I find it hard to understand the third point. Why increasing the number of GPUs decrease the usage? I know that when using DataParallel each splitted mini-batch needs to be synchronized to compute the loss, so it makes sense to see a reduction in GPU usage as this aggregation might be performed on just one GPU or in CPU. However I expected this GPU usage reduction was going to be constant (so a bit less GPU usage because of the DataParallel), but it turns out it inversely proportional to the number of GPU used…

Has anyone experienced this issue before? Can anyone explain this behavior?


1 Like

It looks like you’re not quite getting linear scaling with the number of GPUs…one idea is to try DistributedDataParallel(), which is supposed to be more performant than DataParallel()

I would also profile your code with torch.utils.bottleneck and see if any optimizations help

Finally I would try this multi-epoch dataloader if you observe slowness at the start of each epoch (which would reduce the average images/second in the multi-GPU case):


Thank you I am going to review the code with the help of torch.utils.bottleneck and see if I can get the linear scaling