CUDA driver initialization failed. torch.cuda.is_available()False

I am facing an issue with CUDA initialization on my EC2 instance running Amazon Linux. My environment has an NVIDIA A10G GPU, and the nvidia-smi and nvcc commands show that the GPU and CUDA are correctly installed. However, when I try to use PyTorch to access the GPU, I get the error message indicating that the CUDA driver initialization failed.

CUDA and Driver Info:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

NVIDIA-SMI:

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+   
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|-----------------------------------------+------------------------+----------------------+  
|   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   23C    P0             56W /  300W |       1MiB /  23028MiB |      5%      Default |
+-----------------------------------------+------------------------+----------------------+

PyTorch Script:

import torch
print(torch.__version__)  # 2.5.1+cu124
print(torch.cuda.is_available())  # False
print(torch.cuda.device_count())  # 1
print(torch.cuda.get_device_name(0))  # Error: CUDA driver initialization failed

Output:

False
Traceback (most recent call last):
  File "/home/ec2-user/LLM/test_cuda.py", line 5, in <module>
    print(torch.cuda.get_device_name(0))
  File "/home/ec2-user/py39/lib64/python3.9/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
    return get_device_properties(device).name
  File "/home/ec2-user/py39/lib64/python3.9/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/ec2-user/py39/lib64/python3.9/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

Is there a conflict between the CUDA version installed (12.6) and the PyTorch version (2.5.1+cu124)? Why the CUDA driver initialization failed?

1 Like

Not an expert, just a suggestion. I think torch relies on you installing from the installation matrix, so it’ll work but use latest in the matrix (once you install it.)

See this

The issue is resolved. It seems to be related to Amazon Linux requiring a special version of the GPU driver.
I followed the instructions provided in the AWS documentation:

After downloading and installing the driver as described, everything is working fine now, and I can run torch without any issues.