CUDA driver initialization failed, you might not have a CUDA gpu

Andreas_Reich · November 6, 2023, 10:35am

System:

I have a 1-year old Linux Machine with 1x A100 40GB, plenty of ram and a server-grade CPU. I’m using Ubuntu 20.4…

Main Use:

I use the machine manly to run LLM models with torch (textgeneration-webui and loading LLMS directly, but also other vision algorithms…)

Previously

So, torch and The GPU were working fine and stable under Conda 11.7 with the Correct NVIDIA driver:

Problem & Solution Attempts

→ Then, I upgraded to the 12.2 Driver which is not compatible with the latest pytorch (12.1) yet as it seems (nighlty build didn’t work as well).
→ Then I downgraded to the 11.7 driver and the respective torch, but that stopped working then.

→ Now, I’ve tried every Toolkit on the shelf from cuda-11 til 12.3 and all Available Drivers for my A100:

NVIDIA-Linux-x86_64-450.51.05.run
NVIDIA-Linux-x86_64-515.86.01.run
NVIDIA-Linux-x86_64-525.125.06.run
NVIDIA-Linux-x86_64-535.104.12.run

for all of these toolkits and updated the .bashrc file accordingly.
I also restarted the system every time.

System is fully up-to date.

nvidia-smi would always find the graphics card under any driver:

nvidia-smi
Mon Nov  6 10:01:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:B1:00.0 Off |                   On |
| N/A   32C    P0    36W / 250W |      0MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Still, whatever I do, the system will tell me:

>>> import torch
>>> print(torch.cuda.is_available())
/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392035629/work/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>> print(torch.cuda.get_device_name(0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py", line 419, in get_device_name
    return get_device_properties(device).name
  File "/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py", line 449, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

And when I try to run basically any of my scripts in any of my environments with basically any driver active:

(sentiment) my/machine/and/path$ python myscript.py 
/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392035629/work/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

I’ve also tried this:

I’ve created plenty of fresh environments to check if anything else is interfearing - its just torch that can’t talk to the GPU
I’ve also tried using environments with different python versions like 3.8/3.9/3.10/3.11 unsuccessfully.
I’ve installed plenty of legacy versions, nightly builds and so on.
Also tried to use pip / conda installs.

I’ve been on this for about two weeks by now and have read plenty of other posts that seem similar and have tried many things that helped others but it just won’t work…

Here it was stated that PyTorch seems to be independent of my machines toolkit anyway:

ptrblck about CUDA Toolkit

But that the Driver is relevant:

ptrblck - About Driver & system reboot

If you need more specs or details I’m more than happy to provide anything needed.

Any help/tip is greatly appreciated.

ptrblck · November 6, 2023, 3:04pm

All drivers are compatible within a major version due to the driver’s minor version compatibility support.

Disable MIG and it might work.

Andreas_Reich · November 7, 2023, 1:01pm

Thank you, I couldn’t figure out when the MIG settings were activated, I think I deactivated them previously when I tried to get it to work, but I think I didn’t reboot the system then.

So to anyone with the same issue make sure to reboot after deactivating MIG!

Thanks very much for helping out!

Andreas_Reich · November 7, 2023, 1:02pm

Ah and here’s the command:

sudo nvidia-smi -mig 0

worked for me → Check your nvidia-smi if MIG is active (as shown above). You might need to reboot before you are able to disable mig. Also make sure to sudo systemctl stop display-manager