System:
I have a 1-year old Linux Machine with 1x A100 40GB, plenty of ram and a server-grade CPU. I’m using Ubuntu 20.4…
Main Use:
I use the machine manly to run LLM models with torch (textgeneration-webui and loading LLMS directly, but also other vision algorithms…)
Previously
So, torch and The GPU were working fine and stable under Conda 11.7 with the Correct NVIDIA driver:
|Version:|515.105.01|
|Operating System:|Linux 64-bit|
|CUDA Toolkit:|11.7|
Problem & Solution Attempts
→ Then, I upgraded to the 12.2 Driver which is not compatible with the latest pytorch (12.1) yet as it seems (nighlty build didn’t work as well).
→ Then I downgraded to the 11.7 driver and the respective torch, but that stopped working then.
→ Now, I’ve tried every Toolkit on the shelf from cuda-11 til 12.3 and all Available Drivers for my A100:
NVIDIA-Linux-x86_64-450.51.05.run
NVIDIA-Linux-x86_64-515.86.01.run
NVIDIA-Linux-x86_64-525.125.06.run
NVIDIA-Linux-x86_64-535.104.12.run
for all of these toolkits and updated the .bashrc file accordingly.
I also restarted the system every time.
System is fully up-to date.
nvidia-smi would always find the graphics card under any driver:
nvidia-smi
Mon Nov 6 10:01:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:B1:00.0 Off | On |
| N/A 32C P0 36W / 250W | 0MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Still, whatever I do, the system will tell me:
>>> import torch
>>> print(torch.cuda.is_available())
/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392035629/work/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
>>> print(torch.cuda.get_device_name(0))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py", line 419, in get_device_name
return get_device_properties(device).name
File "/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py", line 449, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
And when I try to run basically any of my scripts in any of my environments with basically any driver active:
(sentiment) my/machine/and/path$ python myscript.py
/home/ubuntu/anaconda3/envs/sentiment/lib/python3.9/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392035629/work/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
I’ve also tried this:
-
I’ve created plenty of fresh environments to check if anything else is interfearing - its just torch that can’t talk to the GPU
-
I’ve also tried using environments with different python versions like 3.8/3.9/3.10/3.11 unsuccessfully.
-
I’ve installed plenty of legacy versions, nightly builds and so on.
-
Also tried to use pip / conda installs.
I’ve been on this for about two weeks by now and have read plenty of other posts that seem similar and have tried many things that helped others but it just won’t work…
Here it was stated that PyTorch seems to be independent of my machines toolkit anyway:
But that the Driver is relevant:
ptrblck - About Driver & system reboot
If you need more specs or details I’m more than happy to provide anything needed.
Any help/tip is greatly appreciated.