Hi community.
I am writing you today because I have profiled different models in single GPU and multi GPU and I found that as I increase the number of GPUs for the model to train on the GPU usage of each GPU decreases. I would like to know if this is normal or if there is a way to increase GPU usage in the multi GPU set up.
The profiling I did comprises, different model complexity, different number of DataLoader workers, and different number of GPUs (with DataParallel):
- Models: Efficiendet-d0, d2, d4 (input resolutions of 512, 768, 1024 to give an idea)
-
#Workers: Varying from 1 to
batch_size * 3
- #GPUs: Varying from 1, 2, and 4 RTX 2080 Ti
These were my findings:
Model | GPU (x1) | GPU [%] | Workers | Batch size | Batch/s | Image/s |
---|---|---|---|---|---|---|
d0 | RTX 2080 Ti | 90-93 | 4 | 8 | 4.48 | 35.8 |
d2 | RTX 2080 Ti | 91-92 | 1 | 3 | 3.02 | 9.06 |
d4 | RTX 2080 Ti | 93-96 | 1 | 1 | 1.85 | 1.85 |
Model | GPU (x2) | GPU [%] | Workers | Batch size | Batch/s | Image/s |
---|---|---|---|---|---|---|
d0 | RTX 2080 Ti | 70-89 | 24 | 16 | 3.35 | 53.6 |
d2 | RTX 2080 Ti | 72-83 | 6 | 6 | 2.33 | 13.98 |
d4 | RTX 2080 Ti | 80-92 | 1 | 2 | 1.50 | 3 |
Model | GPU (x4) | GPU [%] | Workers | Batch size | Batch/s | Image/s |
---|---|---|---|---|---|---|
d0 | RTX 2080 Ti | 45-62 | 8 | 32 | 2.28 | 72.96 |
d2 | RTX 2080 Ti | 58-70 | 6 | 12 | 1.78 | 21.36 |
d4 | RTX 2080 Ti | 65-82 | 2 | 4 | 1.16 | 4.64 |
The conclusions are:
- #Workers: increasing number of workers translates to increase GPU usage on simple model EfficientDet-d0, but when working with EfficientDet-d2,d4 the #workers had little to no influence
- Model complexity: The more complex the model is the more GPU uses
- #GPUs: Increasing the number of GPUs available for a model to train translates in decreasing the GPU usage that the model uses on each of those GPU
While the first two points make sense to me, I find it hard to understand the third point. Why increasing the number of GPUs decrease the usage? I know that when using DataParallel each splitted mini-batch needs to be synchronized to compute the loss, so it makes sense to see a reduction in GPU usage as this aggregation might be performed on just one GPU or in CPU. However I expected this GPU usage reduction was going to be constant (so a bit less GPU usage because of the DataParallel), but it turns out it inversely proportional to the number of GPU used…
Has anyone experienced this issue before? Can anyone explain this behavior?
Thanks!