I’m running an AWS EC2 8x A100-40GB instance and it’s basically working fine, but I just noticed that GPU 0, while retaining memory, is stuck at 0% usage.
It also appears to be running a different process (at the bottom of the nvidia-smi
):
$ watch -n1 nvidia-smi
Every 1.0s: nvidia-smi ip-172-31-12-173: Sat Feb 3 20:45:38 2024
Sat Feb 3 20:45:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:10:1C.0 Off | 0 |
| N/A 25C P0 73W / 400W | 37590MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:10:1D.0 Off | 0 |
| N/A 25C P0 97W / 400W | 37342MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:20:1C.0 Off | 0 |
| N/A 27C P0 98W / 400W | 37282MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:20:1D.0 Off | 0 |
| N/A 25C P0 98W / 400W | 37342MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-40GB On | 00000000:90:1C.0 Off | 0 |
| N/A 27C P0 103W / 400W | 37342MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-40GB On | 00000000:90:1D.0 Off | 0 |
| N/A 25C P0 98W / 400W | 37342MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-40GB On | 00000000:A0:1C.0 Off | 0 |
| N/A 29C P0 103W / 400W | 37282MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-40GB On | 00000000:A0:1D.0 Off | 0 |
| N/A 22C P0 76W / 400W | 37138MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 21094 C python 37582MiB |
| 1 N/A N/A 22119 C ...iniconda3/envs/torch_sat/bin/python 37334MiB |
| 2 N/A N/A 22120 C ...iniconda3/envs/torch_sat/bin/python 37274MiB |
| 3 N/A N/A 22121 C ...iniconda3/envs/torch_sat/bin/python 37334MiB |
| 4 N/A N/A 22122 C ...iniconda3/envs/torch_sat/bin/python 37334MiB |
| 5 N/A N/A 22123 C ...iniconda3/envs/torch_sat/bin/python 37334MiB |
| 6 N/A N/A 22124 C ...iniconda3/envs/torch_sat/bin/python 37274MiB |
| 7 N/A N/A 22125 C ...iniconda3/envs/torch_sat/bin/python 37130MiB |
+---------------------------------------------------------------------------------------+
Why might this be happening?