I am training a large network like ResNet with very small batch size say 25. When I do that, I get a very low and oscillating GPU utilization. I have seen several posts regarding the low GPU utilization in PyTorch. However, they are suggesting either of the following:
“Increase the batchsize.”: But, this is not a computational choice and I want my batch size to be small.
“Increase the number of workers as dataloading might be the bottleneck.”: First of all dataloading is not the bottleneck as it takes much less time. Secondly, increasing the number of loaders increases the running time of my code. Third, low and oscillating GPU utilization persists even after increasing the number of loaders. Hence, this suggestion also does not apply.
“Set shuffle = False”: Again not a feasible solution as I have to shuffle my data somehow.
Do you have any other suggestion for more effective use of GPUs when we have small batchsize?
I am training a model (Resnet-18) with a batch size equal 64. One sample has 772K. I have a very low GPU utilization. In a 10 second interval, 90% of time, all GPUs utilization is 0% with a memory-usage of 10%. Sporadically, there are GPUs that stay only on 0% utilization in entire range of 10 seconds. Is the loading process a bottleneck in my case? How can I use more efficiently all GPUs?
I attached a nvidia-smi manager output.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:04:00.0 Off | 0 |
| N/A 41C P0 66W / 149W | 1097MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:05:00.0 Off | 0 |
| N/A 31C P0 84W / 149W | 1697MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 00000000:84:00.0 Off | 0 |
| N/A 43C P0 59W / 149W | 1677MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 00000000:85:00.0 Off | 0 |
| N/A 30C P0 73W / 149W | 1697MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 00000000:8A:00.0 Off | 0 |
| N/A 33C P0 66W / 149W | 1711MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 00000000:8B:00.0 Off | 0 |
| N/A 46C P0 82W / 149W | 1735MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 00000000:8E:00.0 Off | 0 |
| N/A 31C P0 67W / 149W | 1717MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 00000000:8F:00.0 Off | 0 |
| N/A 49C P0 81W / 149W | 773MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 8369 C python3 1092MiB |
| 1 N/A N/A 8369 C python3 1692MiB |
| 2 N/A N/A 8369 C python3 1672MiB |
| 3 N/A N/A 8369 C python3 1692MiB |
| 4 N/A N/A 8369 C python3 1706MiB |
| 5 N/A N/A 8369 C python3 1730MiB |
| 6 N/A N/A 8369 C python3 1712MiB |
| 7 N/A N/A 8369 C python3 768MiB |
+-----------------------------------------------------------------------------+
I load 256 torch samples (with interpolation applied) in ~0.26 seconds. Without interpolation, it takes ~0.17 seconds. I used the same params (batch_size=64 & num_workers=4) in DataLoader.
Anyway, I observed that increasing number of workers does not improve the loading time.
As transformation, I apply just an interpolation to resize the torch input. I can say from point 2 that there is a 0.09 seconds cost for interpolation on a pool of 256 samples.