GPU 0 (of 8) has memory but is idle

jbm · February 3, 2024, 8:51pm

I’m running an AWS EC2 8x A100-40GB instance and it’s basically working fine, but I just noticed that GPU 0, while retaining memory, is stuck at 0% usage.
It also appears to be running a different process (at the bottom of the nvidia-smi):

$ watch -n1 nvidia-smi

Every 1.0s: nvidia-smi                                                                                                                  ip-172-31-12-173: Sat Feb  3 20:45:38 2024

Sat Feb  3 20:45:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:10:1C.0 Off |                    0 |
| N/A   25C    P0              73W / 400W |  37590MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:10:1D.0 Off |                    0 |
| N/A   25C    P0              97W / 400W |  37342MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:20:1C.0 Off |                    0 |
| N/A   27C    P0              98W / 400W |  37282MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:20:1D.0 Off |                    0 |
| N/A   25C    P0              98W / 400W |  37342MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  | 00000000:90:1C.0 Off |                    0 |
| N/A   27C    P0             103W / 400W |  37342MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          On  | 00000000:90:1D.0 Off |                    0 |
| N/A   25C    P0              98W / 400W |  37342MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          On  | 00000000:A0:1C.0 Off |                    0 |
| N/A   29C    P0             103W / 400W |  37282MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          On  | 00000000:A0:1D.0 Off |                    0 |
| N/A   22C    P0              76W / 400W |  37138MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     21094      C   python                                    37582MiB |
|    1   N/A  N/A     22119      C   ...iniconda3/envs/torch_sat/bin/python    37334MiB |
|    2   N/A  N/A     22120      C   ...iniconda3/envs/torch_sat/bin/python    37274MiB |
|    3   N/A  N/A     22121      C   ...iniconda3/envs/torch_sat/bin/python    37334MiB |
|    4   N/A  N/A     22122      C   ...iniconda3/envs/torch_sat/bin/python    37334MiB |
|    5   N/A  N/A     22123      C   ...iniconda3/envs/torch_sat/bin/python    37334MiB |
|    6   N/A  N/A     22124      C   ...iniconda3/envs/torch_sat/bin/python    37274MiB |
|    7   N/A  N/A     22125      C   ...iniconda3/envs/torch_sat/bin/python    37130MiB |
+---------------------------------------------------------------------------------------+

Why might this be happening?

jbm · February 3, 2024, 9:09pm

Just FYI, cat /proc/21094/cmdline | xargs -0 echo does indicate the train.py script I’m expecting GPU 0 to be running (the other IDs are the same).

ptrblck · February 3, 2024, 11:06pm

What kind of script are you running and does it execute as expected?

jbm · February 5, 2024, 3:36pm

It’s a PyTorch Lightning script, audio VAE training, and there doesn’t seem to be anything wrong with the execution. It does seem to spend a lot of time pulling in data though… Performance on the whole seems pretty bad… I’m new to AWS EC2, so I have to investigate the data drive details… (attached EBS volume)… but the GCP instance I was running last month was significantly faster (GCP was A100-SXM4-80GB version mind you, but 2 vs 8 GPUs).

UPDATE: I did get a significant improvement (~40%) in overall performance by boosting the IOPS on the EBS volume. I actually also reduced the data (there was data on the drive that had been pruned from training due to balance issues) and decreased the overall volume size. After those steps, GPU 0 also came to life (not sure if that’s necessarily connected, but it’s notable).
I did notice—after posting—that in the initial case GPU 0 would spike to >90% very occasionally. I expect fluctuation, of course, but I’d never seen anything like that before.