CUDA not available

MoRoBe · January 15, 2025, 1:11pm

I am aware variations of this have been asked multiple times, but even after working through many of those, I’m still stuck. I’m trying to get pytorch with CUDA support running on my Laptop. However, torch.cuda.is_available() returns False. Selected system information and diagnostic outputs are as follows:

Lenovo ThinkPad P14S Gen4
NVIDIA RTX A500 Laptop GPU
Linux Kernel 6.11.11-1
NVIDIA Driver Version: 550.135

nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.135                Driver Version: 550.135        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A500 Laptop GPU     Off |   00000000:03:00.0 Off |                  N/A |
| N/A   42C    P0              7W /   30W |       8MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

torch.utils.collect_env:

PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Manjaro Linux (x86_64)
GCC version: (GCC) 14.2.1 20240910
Clang version: 18.1.8
CMake version: version 3.31.2
Libc version: glibc-2.40

Python version: 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910] (64-bit runtime)
Python platform: Linux-6.11.11-1-MANJARO-x86_64-with-glibc2.40
Is CUDA available: False
CUDA runtime version: 12.6.85
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA RTX A500 Laptop GPU
Nvidia driver version: 550.135
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.9.5.1
/usr/lib/libcudnn_adv.so.9.5.1
/usr/lib/libcudnn_cnn.so.9.5.1
/usr/lib/libcudnn_engines_precompiled.so.9.5.1
/usr/lib/libcudnn_engines_runtime_compiled.so.9.5.1
/usr/lib/libcudnn_graph.so.9.5.1
/usr/lib/libcudnn_heuristic.so.9.5.1
/usr/lib/libcudnn_ops.so.9.5.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

I’ve also tried using a both a venv and a conda env running pytorch 2.5.1 compiled against CUDA 12.4 with basically the same result.
Not that it should make any difference, but CUDA is both in my PATH and my LD_LIBRARY_PATH.

As far as I understand the whole setup, versions should be matching and I really don’t understand what’s going wrong. Please let me know if you need any additional information!

Edit: I’ve also posted this question to StackOverflow (pytorch - CUDA not available - Stack Overflow) and will comment any solutions found there.

MoRoBe · January 15, 2025, 3:35pm

I just found out torch.cuda.is_available() returns True in a docker container based on pytorch/pytorch2.1.2-cuda12.1-cudnn8-devel

For the moment I can simply work within this container, but would still prefer to also have CUDA available in my host systems pytorch.

ptrblck · January 15, 2025, 4:23pm

It’s unclear what the issue is as you did not post any error.

The answer from your cross-post is also wrong:

Maybe not the reason for the overall issue, but you should update your driver. Currently it only supports up to CUDA 12.4, so CUDA 12.6 is incompatible.

since you don’t need to update the NVIDIA driver for newer minor CUDA versions due to Minor Version Compatibility.

Your locally installed CUDA toolkit won’t be used, since the PyTorch binaries ship with their own CUDA runtime dependencies.

MoRoBe · January 15, 2025, 4:57pm

Hi, thanks for the reply, I was hoping for you, who appears to solve all of these problems, to show up. My follow-up is embarrassing even more so: out of a sudden, torch.cuda.is_available() returns True now. And, now that it does so, it’s working on the host, in both the venv and the conda env and also in another conda env using CUDA 12.1. Unfortunately for anyone ending up here via a search, I haven’t got the slightest idea what has changed in the meantime. So, sorry everyone for bothering you and thanks for the reply anyways!

KFrank · January 15, 2025, 5:07pm

Hi Mo!

I have a laptop with a gpu similar to yours – ThinkPad P16v Gen 2, with
the NVIDIA RTX 3000 Ada Generation Laptop GPU running Ubuntu (not
Manjaro) LTS with kernel 22.04.1-49.

Note that the gpu driver – at least as used by cuda / pytorch – has some
issues with ubuntu’s (and maybe linux’s, in general) power management,
seemingly not restarting properly after a suspend. You might try running
pytorch with cuda immediately after rebooting or try @ptrblck’s recipe
for bouncing cuda:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

Best.

K. Frank

MoRoBe · January 15, 2025, 5:15pm

I’ll give that a try tomorrow morning and report back! If only Linux, in the year 2025, had mastered suspend hooks… Thanks for the suggestion anyways, if that’s the problem I’d never have figured it out by myself!

MoRoBe · January 16, 2025, 8:23am

So directly after waking up from suspend this morning CUDA is still available. My next best guess would be that somethings going on regarding the docking station. For now I’ll mark @KFrank s answer as the best guess for a solution available.

KFrank · January 16, 2025, 4:46pm

Hi Mo!

Here is my best recipe for reproducing the bug.

First, test for cuda by creating (and maybe using) a tensor on the gpu.
Sometimes torch.cuda.is_available() = True for me even when
cuda has quit working:

>>> import torch
>>> t = torch.randn (5, device = 'cuda')
>>> t
tensor([ 2.0058, -0.0592,  1.4067, -1.8612, -0.4359], device='cuda:0')
<*** after a lid-close suspend ***>
>>> torch.cuda.is_available()
True
>>> t
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<path_to_pytorch_install>/site-packages/torch/_tensor.py", line 523, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<path_to_pytorch_install>/site-packages/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<path_to_pytorch_install>/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<path_to_pytorch_install>/torch/_tensor_str.py", line 357, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<path_to_pytorch_install>/torch/_tensor_str.py", line 145, in __init__
    nonzero_finite_vals = torch.masked_select(
                          ^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Recipe to reproduce:

Run python / pytorch and use cuda, for example, by creating a cuda
tensor. (Check that your python process running pytorch / cuda shows
up in nvidia-smi.) The suspend driver-bug doesn’t seem to occur unless
cuda is already in use.

Close the laptop lid to trigger a suspend.

Come out of suspend.

Access / perform a computation with the cuda tensor. At this point, i get
a cuda error. Note, torch.cuda.is_available() doesn’t necessary
return False for me even though cuda isn’t working.

Further color:

Killing and restarting the python process (so that you no longer see the
python process in nvidia-smi) doesn’t fix cuda for me. I can fix cuda
(without a reboot) by killing the python process (and sometimes also
nvidia-smi) to free the nvidia_uvm module, and then running:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

Note, I am under the impression (I haven’t tested this carefully.) that
the gpu is sill working in the sense that I could run various graphics
(e.g., gaming). It’s just that cuda is broken.

Good luck!

K. Frank