No more cuda available after installing last nvidia drivers

wilhelm · May 24, 2024, 12:50pm

Today I updated my system to the following state:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+

I’m running a conda environment on the following system:

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye

I had install pytorch-cuda-12.4. Since it didn’t worked I recreated a conda environment and try everything proposed in the torch webseite:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia, then it didn’t work, so I deleted everything again and install:
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch-nightly -c nvidia
lately I tried to build on my own, which successfully built but everytime I try to run on cuda I get:

Python 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
~/anaconda3/envs/environment/lib/python3.12/site-packages/torch/cuda/__init__.py:127: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1716536554221/work/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

what should I do?
Do I need to downgrade my driver to 12.4 on the base system?
By the way: nvcc, cuDNN, and cudatools are installed and working, the library paths are updated as requested:

export PATH=/usr/local/cuda-12.5/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.5/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

ptrblck · May 24, 2024, 3:17pm

Did you run any CUDA application successfully after the driver update as it seems the update itself has broken your setup? In the past we’ve seen similar issues where users forgot to e.g. restart the machine after a driver update (as requested by the update script) and were running into all kinds of issues.

wilhelm · May 24, 2024, 3:53pm

Yes,
I tried and the examples run perfectly. I even rebootet many times.
But, this problem has been now “covered” by this one hier.
No matter what I do, I cannot get rid of the installed version built on my system.

kovalexal · May 24, 2024, 8:26pm

Hi @wilhelm!

I also updated the drivers yesterday (555.42.03) and cannot get torch to work after that.

I have a Windows 11 laptop and was running nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04 docker container via WSL2 in which I installed torch via miniconda + official torch repo with prebuild binaries.

I tried installing torch 2.3.0+cu118 and torch 2.3.0+cu121 (of course I passed my gpu inside a container) and in every combination I cannot get torch to see the gpu.

I have also tried other containers with cuda 12.4.1 and cannot get it to work.

Strangely nvitop also stopped working correctly, but nvidia-smi works fine.

Have you found a solution to your problem?

UPD I rolled my driver back to the previous version through Device Manager and everything started working as expected.

ptrblck · May 24, 2024, 9:16pm

I’ve answered in the corresponding thread.

wilhelm · May 25, 2024, 9:04am

No, I couldn’t fix the problem yet.
No matter which way I use to install pytorch with cuda support.
I tried to install cudatools 12.4 but nothing.

I fear, we have to wait for a new release or an update.

@ptrblck which thread do you mind?

wilhelm · May 25, 2024, 10:22am

Here a short update about the situation

I removed all the nvidia drivers, toolkits, cuda etc. from the whole system.
That means:

     $ sudo apt purge nvidia*
     $ sudo apt autoremove
     $ sudo apt autoclean
     $ sudo rm -rf /usr/local/cuda*
     $ sudo reboot

Once rebooted I made sure, that no nvidia drivers where found ($ nvidia-smi -> command not found)
I installed the nvidia drivers and the cuda toolkit as described in the official nvidia website:

    $ sudo apt install nvidia-driver firmware-misc-nonfree
    $ sudo reboot
    $ sudo apt install nvidia-cuda-dev nvidia-cuda-toolkit
    $ sudo reboot
    $ sudo apt install cuda-toolkit-12-5
    $ sudo reboot

Now I have the following system:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|

with the following setup in my .bashrc

export PATH=/usr/local/cuda-12.5/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.5/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

In a new conda environment I try to install the torch with cuda support again, but nothing. cuda is not recognized

Python 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.4.0.dev20240525'
>>> torch.cuda.is_available()
False

wilhelm · May 26, 2024, 1:21pm

Is there any update about the 12.5 version?
Should be keep installing the 12.4 version?
Or the installation procedure for 12.4 should be ok even for the cuda: 12.5?

ptrblck · May 26, 2024, 5:39pm

Your locally installed CUDA toolkit won’t be used since the PyTorch binaries ship with their own CUDA runtime dependencies unless you build PyTorch from source or a custom CUDA extension.

wilhelm · May 26, 2024, 5:46pm

Well… then, after 2 days of trying, I can say, that the following instruction:

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch-nightly -c nvidia

for some reasons is not working anymore (no cuda found).
I had no problems before.

ptrblck · May 27, 2024, 1:07pm

I’ve created a new conda env and it still works for me:

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch-nightly -c nvidia
...
## Package Plan ##

...

  added / updated specs:
    - pytorch
    - pytorch-cuda=12.4
    - torchaudio
    - torchvision
...
The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
...
    cuda-cudart-12.4.127       |                0         198 KB  nvidia
    cuda-cupti-12.4.127        |                0        16.4 MB  nvidia
    cuda-libraries-12.4.0      |                0           2 KB  nvidia
    cuda-nvrtc-12.4.127        |                0        21.0 MB  nvidia
    cuda-nvtx-12.4.127         |                0          58 KB  nvidia
    cuda-opencl-12.4.127       |                0          12 KB  nvidia
    cuda-runtime-12.4.0        |                0           2 KB  nvidia
...
    pytorch-2.4.0.dev20240527  |py3.10_cuda12.4_cudnn8.9.2_0        1.38 GB  pytorch-nightly
    pytorch-cuda-12.4          |       hc786d27_6           7 KB  pytorch-nightly
    pyyaml-6.0.1               |  py310h2372a71_1         167 KB  conda-forge
    requests-2.32.2            |     pyhd8ed1ab_0          57 KB  conda-forge
    sympy-1.12                 | pypyh9d50eac_103         4.1 MB  conda-forge
    tbb-2021.12.0              |       h297d8ca_1         190 KB  conda-forge
    torchaudio-2.2.0.dev20240527|      py310_cu124         6.2 MB  pytorch-nightly
    torchtriton-3.0.0+45fff310c8|            py310       250.5 MB  pytorch-nightly
    torchvision-0.19.0.dev20240527|      py310_cu124         8.3 MB  pytorch-nightly
...
python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.randn(1).cuda())"
2.4.0.dev20240527                                                                                                                                                                                                  
12.4                                                                                                                                                                                                               
tensor([1.2858], device='cuda:0')

wilhelm · May 27, 2024, 5:46pm

Thanks,

but can I ask you, which nvidia-drivers you have installed?

No way in my case:

## Package Plan ##

  added / updated specs:
    - pytorch
    - pytorch-cuda=12.4
    - torchaudio
    - torchvision

The following NEW packages will be INSTALLED:

  cuda-cupti         nvidia/linux-64::cuda-cupti-12.4.127-0 
  cuda-libraries     nvidia/linux-64::cuda-libraries-12.4.0-0 
  cuda-nvtx          nvidia/linux-64::cuda-nvtx-12.4.127-0 
  cuda-opencl        nvidia/linux-64::cuda-opencl-12.4.127-0 
  cuda-runtime       nvidia/linux-64::cuda-runtime-12.4.0-0 
  ffmpeg             conda-forge/linux-64::ffmpeg-4.4.0-h6987444_4 
  gmp                conda-forge/linux-64::gmp-6.3.0-h59595ed_1 
  gnutls             pkgs/main/linux-64::gnutls-3.6.15-he1e5248_0 
  libcufft           nvidia/linux-64::libcufft-11.2.0.44-0 
  libcufile          nvidia/linux-64::libcufile-1.9.1.3-0 
  libcurand          nvidia/linux-64::libcurand-10.3.5.147-0 
  libcusolver        nvidia/linux-64::libcusolver-11.6.0.99-0 
  libidn2            conda-forge/linux-64::libidn2-2.3.7-hd590300_0 
  libnpp             nvidia/linux-64::libnpp-12.2.5.2-0 
  libnvfatbin        nvidia/linux-64::libnvfatbin-12.4.127-0 
  libnvjpeg          nvidia/linux-64::libnvjpeg-12.3.1.89-0 
  libunistring       pkgs/main/linux-64::libunistring-0.9.10-h27cfd23_0 
  libvpx             pkgs/main/linux-64::libvpx-1.11.0-h295c915_0 
  llvm-openmp        conda-forge/linux-64::llvm-openmp-15.0.7-h0cdce71_0 
  mpmath             pkgs/main/linux-64::mpmath-1.3.0-py312h06a4308_0 
  nettle             pkgs/main/linux-64::nettle-3.7.3-hbbd107a_1 
  networkx           conda-forge/noarch::networkx-3.3-pyhd8ed1ab_1 
  openh264           pkgs/main/linux-64::openh264-2.1.1-h4ff587b_0 
  pytorch            pytorch-nightly/linux-64::pytorch-2.4.0.dev20240527-py3.12_cpu_0 
  pytorch-cuda       pytorch-nightly/linux-64::pytorch-cuda-12.4-hc786d27_6 
  pytorch-mutex      pytorch-nightly/noarch::pytorch-mutex-1.0-cpu 
  sympy              pkgs/main/linux-64::sympy-1.12-py312h06a4308_0 
  torchaudio         pytorch-nightly/linux-64::torchaudio-2.2.0.dev20240527-py312_cpu 
  torchvision        pytorch-nightly/linux-64::torchvision-0.19.0.dev20240527-py312_cpu 
  x264               conda-forge/linux-64::x264-1!161.3030-h7f98852_1 
  x265               conda-forge/linux-64::x265-3.5-h924138e_3 

Proceed ([y]/n)? y

Downloading and Extracting Packages:
...

which results in:

Python 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False

ptrblck · May 28, 2024, 3:41pm

I’ve currently 530.30.02 on the used system.

jesmine · May 30, 2024, 2:43am

RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

jesmine · May 30, 2024, 2:44am

C:\Users\hp\anaconda3\python.exe “D:\CT_Image\Madical Image Process\Codes\Github\Experimental codes\EDCNN\train\train_denoise.py”
[‘D:\CT_Image\Madical Image Process\Codes\Github\Experimental codes\EDCNN\train’, ‘D:\CT_Image\Madical Image Process\Codes\Github\Experimental codes\EDCNN’, ‘C:\Users\hp\anaconda3\python311.zip’, ‘C:\Users\hp\anaconda3\DLLs’, ‘C:\Users\hp\anaconda3\Lib’, ‘C:\Users\hp\anaconda3’, ‘C:\Users\hp\anaconda3\Lib\site-packages’, ‘C:\Users\hp\anaconda3\Lib\site-packages\win32’, ‘C:\Users\hp\anaconda3\Lib\site-packages\win32\lib’, ‘C:\Users\hp\anaconda3\Lib\site-packages\Pythonwin’, ‘D:\CT_Image\Madical Image Process\Codes\Github\Experimental codes\EDCNN\train\…/dataset/’, ‘D:\CT_Image\Madical Image Process\Codes\Github\Experimental codes\EDCNN\train\…’]
D:\CT_Image\Madical Image Process\Codes\Github\Experimental codes\EDCNN\train
Namespace(batch_size=8, nepoch=100, train_workers=0, eval_workers=0, dataset=‘AAPM’, pretrain_weights=‘./logs/denoising/AAPM/EDCNNm/models/model_latest.pth’, optimizer=‘adamw’, lr_initial=0.0002, step_lr=50, weight_decay=0.0001, gpu=‘0’, arch=‘EDCNN’, mode=‘denoising’, dd_in=1, save_dir=‘./logs/’, save_images=False, env=‘’, checkpoint=10, decay_lrs=4, norm_layer=‘nn.LayerNorm’, embed_dim=32, win_size=8, token_projection=‘linear’, token_mlp=‘leff’, att_se=False, modulator=False, vit_dim=256, vit_depth=12, vit_nheads=8, vit_mlp_dim=512, vit_patch_size=16, global_skip=False, local_skip=False, vit_share=False, train_ps=64, val_ps=128, resume=False, train_dir=‘./datasets/AAPM/train’, val_dir=‘./datasets/AAPM/val’, test_dir=‘./datasets/AAPM/test’, warmup=True, warmup_epochs=3, result_dir=‘./logs/denoising/AAPM/Uformer_S_0815_LeFF/results/’, local_rank=-1, distribute=False, distribute_mode=‘DDP’)
Now time is : 2024-05-30T10:19:48.269591
Using warmup and cosine strategy!
C:\Users\hp\anaconda3\Lib\site-packages\torch\optim\lr_scheduler.py:143: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at torch.optim — PyTorch 2.3 documentation
warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "
C:\Users\hp\anaconda3\Lib\site-packages\torchvision\models_utils.py:135: UserWarning: Using ‘weights’ and ‘progress’ as positional parameter(s) is deprecated since 0.13 and may be removed in the future. Please use keyword parameter(s) instead.
warnings.warn(
C:\Users\hp\anaconda3\Lib\site-packages\torchvision\models_utils.py:223: UserWarning: Arguments other than a weight enum or None for ‘weights’ are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=ResNet50_Weights.IMAGENET1K_V1. You can also use weights=ResNet50_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
===> Loading datasets
Sizeof training set: 2039 , sizeof validation set: 128
===> Start Epoch 1 End Epoch 100

Evaluation after every 255 Iterations !!!

60%|██████ | 153/255 [00:15<00:10, 9.96it/s]
Traceback (most recent call last):
File “D:\CT_Image\Madical Image Process\Codes\Github\Experimental codes\EDCNN\train\train_denoise.py”, line 198, in
loss.backward()
File “C:\Users\hp\anaconda3\Lib\site-packages\torch_tensor.py”, line 525, in backward
torch.autograd.backward(
File “C:\Users\hp\anaconda3\Lib\site-packages\torch\autograd_init_.py”, line 267, in backward
_engine_run_backward(
File “C:\Users\hp\anaconda3\Lib\site-packages\torch\autograd\graph.py”, line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Process finished with exit code 1
can you please advise what’s wrong here?

wilhelm · May 30, 2024, 6:21am

Your error has nothing to do with this thread.
If you would have searched in the forum, you would have found this.

renatoseb · May 31, 2024, 4:23am

I got the same problem. But this output.
nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

☁  ~  conda install pytorch torchvision torchaudio pytorch-cuda=12.5 -c pytorch-nightly -c nvidia

Channels:
 - pytorch-nightly
 - nvidia
 - defaults
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - pytorch-cuda=12.5*

Current channels:

  - https://conda.anaconda.org/pytorch-nightly
  - https://conda.anaconda.org/nvidia
  - defaults
  - https://conda.anaconda.org/pytorch/linux-64
  - https://conda.anaconda.org/pytorch/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

In python with conda:

CUDA available: False
CUDA version: None
Device count: 0
Traceback (most recent call last):
  File "/home/renatoseb/test_cuda.py", line 5, in <module>
    print("Current device:", torch.cuda.current_device())
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/renatoseb/anaconda3/lib/python3.11/site-packages/torch/cuda/__init__.py", line 778, in current_device
    _lazy_init()
  File "/home/renatoseb/anaconda3/lib/python3.11/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

wilhelm · May 31, 2024, 5:34am

It is because of the last nvidia drivers.
Sadly we can only hope, that an update will be released soon

ptrblck · May 31, 2024, 12:52pm

You are trying to install an invalid package. Visit the install instructions on your website, select a valid CUDA version and copy/paste the command into your terminal.

It’s not, since the install command is invalid as described.

ali0une · June 1, 2024, 11:25am

Had the same problem on Debian 12, got to revert NViDiA drivers to 550.54.
555 version is not compatible with current pytorch packages as far as i know.

First uninstal :

sudo apt remove nvidia-*
sudo apt autoremove

Edit your /etc/preferences file :
sudo nano /etc/apt/preferences

and add :

Package: *
Pin: release o=NVIDIA,l=NVIDIA CUDA
Pin-Priority: 996

Package: /nvidia/ /cuda/ /nvcuvid/ /nvctrl/
Pin: version 550.54.*
Pin-Priority: 1000

update apt :
sudo apt update

result of apt policy nvidia-driver nvidia-cuda-dev nvidia-cuda-toolkit nvidia-cuda-toolkit-gcc :

nvidia-driver:
  Installé : 550.54.15-1
  Candidat : 550.54.15-1
 Table de version :
     555.42.02-1 996
        996 https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64  Packages
 *** 550.54.15-1 1000
        996 https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64  Packages
        100 /var/lib/dpkg/status
     550.54.14-1 1000
        996 https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64  Packages
     545.23.08-1 996
        996 https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64  Packages
     545.23.06-1 996
        996 https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64  Packages
     525.147.05-7~deb12u1 995
        995 https://deb.debian.org/debian bookworm-updates/non-free amd64 Packages
     525.147.05-4~deb12u1 990
        990 https://deb.debian.org/debian bookworm/non-free amd64 Packages
nvidia-cuda-dev:
  Installé : 11.8.89~11.8.0-5~deb12u1
  Candidat : 11.8.89~11.8.0-5~deb12u1
 Table de version :
 *** 11.8.89~11.8.0-5~deb12u1 990
        990 https://deb.debian.org/debian bookworm/non-free amd64 Packages
        100 /var/lib/dpkg/status
nvidia-cuda-toolkit:
  Installé : 11.8.89~11.8.0-5~deb12u1
  Candidat : 11.8.89~11.8.0-5~deb12u1
 Table de version :
 *** 11.8.89~11.8.0-5~deb12u1 990
        990 https://deb.debian.org/debian bookworm/non-free amd64 Packages
        100 /var/lib/dpkg/status
nvidia-cuda-toolkit-gcc:
  Installé : 11.8.0-5~deb12u1
  Candidat : 11.8.0-5~deb12u1
 Table de version :
 *** 11.8.0-5~deb12u1 990
        990 https://deb.debian.org/debian bookworm/non-free amd64 Packages
        100 /var/lib/dpkg/status

install with apt :

sudo apt install nvidia-driver nvidia-cuda-dev nvidia-cuda-toolkit nvidia-cuda-toolkit-gcc

OR
install with aptitude if apt complains about dependencies (don’t be afraid of removing packages like libcuda everything will be set up correctly after)*, say no to the first proposal and select the second :

sudo aptitude install nvidia-driver nvidia-cuda-dev nvidia-cuda-toolkit nvidia-cuda-toolkit-gcc

reboot, profit!

Everything is working fine now, lesson learned “if it ain’t broke, don’t fix it”

Edit : some typos, added clarifications.