Hi,
(I have combed through the similar topics on the forums and unfortunately still cannot find a solution for my problem)
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ nvidia-smi
Thu Feb 18 17:10:56 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:14:00.0 Off | On |
| N/A 32C P0 35W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-PCIE-40GB On | 00000000:15:00.0 Off | On |
| N/A 31C P0 33W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-PCIE-40GB On | 00000000:39:00.0 Off | On |
| N/A 32C P0 35W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-PCIE-40GB On | 00000000:3A:00.0 Off | On |
| N/A 29C P0 29W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-PCIE-40GB On | 00000000:88:00.0 Off | On |
| N/A 31C P0 32W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-PCIE-40GB On | 00000000:89:00.0 Off | On |
| N/A 74C P0 216W / 250W | 25776MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-PCIE-40GB On | 00000000:B1:00.0 Off | On |
| N/A 31C P0 32W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-PCIE-40GB On | 00000000:B2:00.0 Off | On |
| N/A 31C P0 31W / 250W | 13MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
You can see that GPU #5 is running a job.
Now, let’s do the following
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ export CUDA_VISIBLE_DEVICES=0,1,2
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ echo $CUDA_VISIBLE_DEVICES
0,1,2
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ python
Python 3.8.1 (default, Jan 21 2020, 07:30:43)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import os
>>> os.getenv('CUDA_VISIBLE_DEVICES')
'0,1,2'
>>> torch.cuda.device_count()
1
>>>
Running Python 3.8.1 and Torch ‘1.7.1+cu110’
I cannot figure out why…
Anybody has any idea?
Thank you.
Edit: On further investigation it looks like the issue can be related to NVIDIA’s new MIG technology. It looks like with MIG enabled CUDA will enumerate only one device… I wonder if anyone has experience with that.