Torch.cuda.device_count() always returns 1

mkserge · February 18, 2021, 10:14pm

Hi,

(I have combed through the similar topics on the forums and unfortunately still cannot find a solution for my problem)

(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ nvidia-smi
Thu Feb 18 17:10:56 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:14:00.0 Off |                   On |
| N/A   32C    P0    35W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-PCIE-40GB      On   | 00000000:15:00.0 Off |                   On |
| N/A   31C    P0    33W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-PCIE-40GB      On   | 00000000:39:00.0 Off |                   On |
| N/A   32C    P0    35W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-PCIE-40GB      On   | 00000000:3A:00.0 Off |                   On |
| N/A   29C    P0    29W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-PCIE-40GB      On   | 00000000:88:00.0 Off |                   On |
| N/A   31C    P0    32W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-PCIE-40GB      On   | 00000000:89:00.0 Off |                   On |
| N/A   74C    P0   216W / 250W |  25776MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-PCIE-40GB      On   | 00000000:B1:00.0 Off |                   On |
| N/A   31C    P0    32W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-PCIE-40GB      On   | 00000000:B2:00.0 Off |                   On |
| N/A   31C    P0    31W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

You can see that GPU #5 is running a job.

Now, let’s do the following

(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ export CUDA_VISIBLE_DEVICES=0,1,2
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ echo $CUDA_VISIBLE_DEVICES
0,1,2
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ python
Python 3.8.1 (default, Jan 21 2020, 07:30:43)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import os
>>> os.getenv('CUDA_VISIBLE_DEVICES')
'0,1,2'
>>> torch.cuda.device_count()
1
>>>

Running Python 3.8.1 and Torch ‘1.7.1+cu110’

I cannot figure out why…

Anybody has any idea?

Thank you.

Edit: On further investigation it looks like the issue can be related to NVIDIA’s new MIG technology. It looks like with MIG enabled CUDA will enumerate only one device… I wonder if anyone has experience with that.

ptrblck · February 19, 2021, 1:08am

Yes, that is the expected behavior.
This post explains it a bit more.

mkserge · February 19, 2021, 2:18am

Thank you @ptrblck that is very helpful.

I also dug into NVIDIA’s documentation on MIGs and am still not sure about the following,

Does that mean that if MIG is enabled on even a single GPU in the system then all the others essentially become useless?

Also from NVIDIA’s docs:

CUDA will not enumerate non-MIG GPU if any compute instance is enumerated on any other GPU

So, that seems to me to imply that all non-MIG GPUs become essentially inaccessible, is that right?

Second point it, it should still be possible to access different GPU instances (across different GPUs) with explicitly setting the CUDA_VISIBLE_DEVICES like

CUDA_VISIBLE_DEVICES=MIG-GPU-15b2013a-3a22-6ae2-eae9-967e9bda9007/7/0

It’s just that you cannot set several at a time. But you can use one per process.

Is my understanding right?

Thanks you for your help

ptrblck · February 19, 2021, 5:58am

That’s also my understanding. You are able to use a MIG instance in a separate process (otherwise you would basically remove all other MIG instances from the node), but other compute instances won’t be visible in this process.

mkserge · February 19, 2021, 4:41pm

Ok, I played with it a bit more and can confirm both points.

If any GPUs are in MIG mode in the system, CUDA does not see the non-MIG GPUs.

(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ nvidia-smi -L
*[Omitting some GPUs for brevity]*
GPU 4: A100-PCIE-40GB (UUID: GPU-4dc4cf75-1fd9-6d80-74ce-c26a52cecd47)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-4dc4cf75-1fd9-6d80-74ce-c26a52cecd47/7/0)
MIG 1g.5gb Device 1: (UUID: MIG-GPU-4dc4cf75-1fd9-6d80-74ce-c26a52cecd47/8/0)
MIG 1g.5gb Device 2: (UUID: MIG-GPU-4dc4cf75-1fd9-6d80-74ce-c26a52cecd47/9/0)
MIG 1g.5gb Device 3: (UUID: MIG-GPU-4dc4cf75-1fd9-6d80-74ce-c26a52cecd47/10/0)
MIG 1g.5gb Device 4: (UUID: MIG-GPU-4dc4cf75-1fd9-6d80-74ce-c26a52cecd47/11/0)
MIG 1g.5gb Device 5: (UUID: MIG-GPU-4dc4cf75-1fd9-6d80-74ce-c26a52cecd47/12/0)
MIG 1g.5gb Device 6: (UUID: MIG-GPU-4dc4cf75-1fd9-6d80-74ce-c26a52cecd47/13/0)
GPU 5: A100-PCIE-40GB (UUID: GPU-909ef3d1-5b61-d36a-0f03-ae7b85464540)
MIG 7g.40gb Device 0: (UUID: MIG-GPU-909ef3d1-5b61-d36a-0f03-ae7b85464540/0/0)
GPU 6: A100-PCIE-40GB (UUID: GPU-67260e8f-5838-5b68-c651-1077b44f562f)
GPU 7: A100-PCIE-40GB (UUID: GPU-3761d9cd-7737-fa3e-a8c6-00265e242f04)
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ export CUDA_VISIBLE_DEVICES='6,7'
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$
(venv_a100) [sergey_mkrtchyan@a100-demo cformers]$ python
Python 3.8.1 (default, Jan 21 2020, 07:30:43)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import os
>>> os.getenv('CUDA_VISIBLE_DEVICES')
'6,7'
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
/mnt/nlu/users/sergey_mkrtchyan/workspace/cformers/venv_a100/lib/python3.8/site-packages/torch/cuda/__init__.py:104: UserWarning:
A100-PCIE-40GB MIG 1g.5gb with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the A100-PCIE-40GB MIG 1g.5gb GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
0
>>>

Notice how it grabs 1g.5gb MIG instance.

In MIG mode you are free to grab any instances in separate processes, but cannot grab two at a time.

This makes MIG a little hard to swallow I would say.

Thanks a lot for the help, @ptrblck !