Low GPU utilization problem

I am training a large network like ResNet with very small batch size say 25. When I do that, I get a very low and oscillating GPU utilization. I have seen several posts regarding the low GPU utilization in PyTorch. However, they are suggesting either of the following:

  1. “Increase the batchsize.”: But, this is not a computational choice and I want my batch size to be small.

  2. “Increase the number of workers as dataloading might be the bottleneck.”: First of all dataloading is not the bottleneck as it takes much less time. Secondly, increasing the number of loaders increases the running time of my code. Third, low and oscillating GPU utilization persists even after increasing the number of loaders. Hence, this suggestion also does not apply.

  3. “Set shuffle = False”: Again not a feasible solution as I have to shuffle my data somehow.

Do you have any other suggestion for more effective use of GPUs when we have small batchsize?

Hi @kko,
Did you manage to handle this issue?

I am training a model (Resnet-18) with a batch size equal 64. One sample has 772K. I have a very low GPU utilization. In a 10 second interval, 90% of time, all GPUs utilization is 0% with a memory-usage of 10%. Sporadically, there are GPUs that stay only on 0% utilization in entire range of 10 seconds. Is the loading process a bottleneck in my case? How can I use more efficiently all GPUs?

I attached a nvidia-smi manager output.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:04:00.0 Off |                    0 |
| N/A   41C    P0    66W / 149W |   1097MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:05:00.0 Off |                    0 |
| N/A   31C    P0    84W / 149W |   1697MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:84:00.0 Off |                    0 |
| N/A   43C    P0    59W / 149W |   1677MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:85:00.0 Off |                    0 |
| N/A   30C    P0    73W / 149W |   1697MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 00000000:8A:00.0 Off |                    0 |
| N/A   33C    P0    66W / 149W |   1711MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 00000000:8B:00.0 Off |                    0 |
| N/A   46C    P0    82W / 149W |   1735MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 00000000:8E:00.0 Off |                    0 |
| N/A   31C    P0    67W / 149W |   1717MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 00000000:8F:00.0 Off |                    0 |
| N/A   49C    P0    81W / 149W |    773MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8369      C   python3                          1092MiB |
|    1   N/A  N/A      8369      C   python3                          1692MiB |
|    2   N/A  N/A      8369      C   python3                          1672MiB |
|    3   N/A  N/A      8369      C   python3                          1692MiB |
|    4   N/A  N/A      8369      C   python3                          1706MiB |
|    5   N/A  N/A      8369      C   python3                          1730MiB |
|    6   N/A  N/A      8369      C   python3                          1712MiB |
|    7   N/A  N/A      8369      C   python3                           768MiB |
+-----------------------------------------------------------------------------+

Have you checked if dataloading is the bottleneck, as I have seen usually this is the reason for low utilization.

  1. How many workers are you using for dataloading?
  2. How are the images stored ? How much time on average does it take to load the image from disk ?
  3. Are the transforms very time consuming ?

@user_123454321 Thank you for your quick response.

  1. I am using 4 workers as follows:
torch.utils.data.DataLoader(dataset,
                            batch_size=64,
                            sampler=SubsetRandomSampler(train_idx), 
                            num_workers=4, 
                            worker_init_fn=init_worker)
  1. I wrote, previously training phase, a torch version of samples on disk with:
torch.save(transforms.ToTensor(img), path / 'data.pt')

I load 256 torch samples (with interpolation applied) in ~0.26 seconds. Without interpolation, it takes ~0.17 seconds. I used the same params (batch_size=64 & num_workers=4) in DataLoader.
Anyway, I observed that increasing number of workers does not improve the loading time.

  1. As transformation, I apply just an interpolation to resize the torch input. I can say from point 2 that there is a 0.09 seconds cost for interpolation on a pool of 256 samples.