CUDA driver initialization failed. torch.cuda.is_available()False

open2smu · December 2, 2024, 9:17am

I am facing an issue with CUDA initialization on my EC2 instance running Amazon Linux. My environment has an NVIDIA A10G GPU, and the nvidia-smi and nvcc commands show that the GPU and CUDA are correctly installed. However, when I try to use PyTorch to access the GPU, I get the error message indicating that the CUDA driver initialization failed.

CUDA and Driver Info:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

NVIDIA-SMI:

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+   
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|-----------------------------------------+------------------------+----------------------+  
|   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   23C    P0             56W /  300W |       1MiB /  23028MiB |      5%      Default |
+-----------------------------------------+------------------------+----------------------+

PyTorch Script:

import torch
print(torch.__version__)  # 2.5.1+cu124
print(torch.cuda.is_available())  # False
print(torch.cuda.device_count())  # 1
print(torch.cuda.get_device_name(0))  # Error: CUDA driver initialization failed

Output:

False
Traceback (most recent call last):
  File "/home/ec2-user/LLM/test_cuda.py", line 5, in <module>
    print(torch.cuda.get_device_name(0))
  File "/home/ec2-user/py39/lib64/python3.9/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
    return get_device_properties(device).name
  File "/home/ec2-user/py39/lib64/python3.9/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/ec2-user/py39/lib64/python3.9/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

Is there a conflict between the CUDA version installed (12.6) and the PyTorch version (2.5.1+cu124)? Why the CUDA driver initialization failed?

Aknw_Fen · December 2, 2024, 11:42am

Not an expert, just a suggestion. I think torch relies on you installing from the installation matrix, so it’ll work but use latest in the matrix (once you install it.)

See this

open2smu · December 3, 2024, 6:20am

The issue is resolved. It seems to be related to Amazon Linux requiring a special version of the GPU driver.
I followed the instructions provided in the AWS documentation:

After downloading and installing the driver as described, everything is working fine now, and I can run torch without any issues.