I am facing an issue with CUDA initialization on my EC2 instance running Amazon Linux. My environment has an NVIDIA A10G GPU, and the nvidia-smi
and nvcc
commands show that the GPU and CUDA are correctly installed. However, when I try to use PyTorch to access the GPU, I get the error message indicating that the CUDA driver initialization failed.
CUDA and Driver Info:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
NVIDIA-SMI:
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|-----------------------------------------+------------------------+----------------------+
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 23C P0 56W / 300W | 1MiB / 23028MiB | 5% Default |
+-----------------------------------------+------------------------+----------------------+
PyTorch Script:
import torch
print(torch.__version__) # 2.5.1+cu124
print(torch.cuda.is_available()) # False
print(torch.cuda.device_count()) # 1
print(torch.cuda.get_device_name(0)) # Error: CUDA driver initialization failed
Output:
False
Traceback (most recent call last):
File "/home/ec2-user/LLM/test_cuda.py", line 5, in <module>
print(torch.cuda.get_device_name(0))
File "/home/ec2-user/py39/lib64/python3.9/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
return get_device_properties(device).name
File "/home/ec2-user/py39/lib64/python3.9/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/home/ec2-user/py39/lib64/python3.9/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
Is there a conflict between the CUDA version installed (12.6) and the PyTorch version (2.5.1+cu124)? Why the CUDA driver initialization failed?