Low GPU utilization problem

mihai5 · January 19, 2021, 8:34am

Hi @kko,
Did you manage to handle this issue?

I am training a model (Resnet-18) with a batch size equal 64. One sample has 772K. I have a very low GPU utilization. In a 10 second interval, 90% of time, all GPUs utilization is 0% with a memory-usage of 10%. Sporadically, there are GPUs that stay only on 0% utilization in entire range of 10 seconds. Is the loading process a bottleneck in my case? How can I use more efficiently all GPUs?

I attached a nvidia-smi manager output.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:04:00.0 Off |                    0 |
| N/A   41C    P0    66W / 149W |   1097MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:05:00.0 Off |                    0 |
| N/A   31C    P0    84W / 149W |   1697MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:84:00.0 Off |                    0 |
| N/A   43C    P0    59W / 149W |   1677MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:85:00.0 Off |                    0 |
| N/A   30C    P0    73W / 149W |   1697MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 00000000:8A:00.0 Off |                    0 |
| N/A   33C    P0    66W / 149W |   1711MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 00000000:8B:00.0 Off |                    0 |
| N/A   46C    P0    82W / 149W |   1735MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 00000000:8E:00.0 Off |                    0 |
| N/A   31C    P0    67W / 149W |   1717MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 00000000:8F:00.0 Off |                    0 |
| N/A   49C    P0    81W / 149W |    773MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8369      C   python3                          1092MiB |
|    1   N/A  N/A      8369      C   python3                          1692MiB |
|    2   N/A  N/A      8369      C   python3                          1672MiB |
|    3   N/A  N/A      8369      C   python3                          1692MiB |
|    4   N/A  N/A      8369      C   python3                          1706MiB |
|    5   N/A  N/A      8369      C   python3                          1730MiB |
|    6   N/A  N/A      8369      C   python3                          1712MiB |
|    7   N/A  N/A      8369      C   python3                           768MiB |
+-----------------------------------------------------------------------------+