I was trying to run rstar main.py for testing purpose on an HPC and i got this error
i tried some debugging step suggested by chatgpt
(rstar) [ingenx@rdgpu01 ~]$ nvidia-smi
Wed Feb 12 13:35:14 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:3B:00.0 Off | On |
| N/A 51C P0 68W / 300W | 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:5E:00.0 Off | On |
| N/A 56C P0 79W / 300W | 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| No MIG devices found |
+---------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
(rstar) [ingenx@rdgpu01 ~]$ nvcc --version
-bash: nvcc: command not found
(rstar) [ingenx@rdgpu01 ~]$ python3 -c "import torch; print('CUDA Available:', torch.cuda.is_available())"
/home/ingenx/miniconda3/envs/rstar/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
CUDA Available: False
(rstar) [ingenx@rdgpu01 ~]$ python3 -c "import torch; print('Number of GPUs:', torch.cuda.device_count())"
Number of GPUs: 2
(rstar) [ingenx@rdgpu01 ~]$ python3 -c "import torch; print('GPU Name:', torch.cuda.get_device_name(0))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/ingenx/miniconda3/envs/rstar/lib/python3.11/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
return get_device_properties(device).name
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ingenx/miniconda3/envs/rstar/lib/python3.11/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
_lazy_init() # will define _get_device_properties
^^^^^^^^^^^^
File "/home/ingenx/miniconda3/envs/rstar/lib/python3.11/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
i was not able to find any solution on internet about this problem