RuntimeError: Unexpected error from cudaGetDeviceCount() when i use a6000

hey dude, please help me, i search all the google,and i can’t fix it. here is my error

Python 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch

>>> 
>>> torch.cuda.is_available()
/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>> torch.cuda
<module 'torch.cuda' from '/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py'>
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py", line 674, in current_device
    _lazy_init()
  File "/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal

here is my torch/cuda version

here is my nvidia driver version

Thu Apr 13 00:27:09 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:B3:00.0 Off |                  Off |
| 30%   30C    P8     9W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Your driver is running into an initialization error, so you might need to update/reinstall your drivers.

i update driver many times,it still not work. here is my driver version now

Thu Apr 13 10:12:26 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:B3:00.0 Off | Off |
| 30% 60C P0 67W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

In case this setup was working before I would recommend trying to build and execute any CUDA sample to verify that it’s still working as the errors still point to a setup issue.

Actually, when I was using the 2080ti, torch2.0 was working fine. But when I switched to the a6000 and installed the relevant drivers, the same code started throwing errors.

Thanks, dude! you know what? The reason is due to my non-powered 2080ti, because of my laziness, I left it in the computer case without powering it on. Oh my god!When I took it off… ,Surprise !!!

Good to hear you have isolated and fixed the issue. This is indeed a new one, but makes sense assuming the driver tries to initialize the “dead” GPU.

In my case this not works in Ubuntu
Ive got last 535 cuda with 530 driver
nvTop working and shows my CPU, I can play games or start Heaven test same time but…

python test3.py
__Python VERSION: 3.11.4+ (main, Jun 28 2023, 08:52:25) [GCC 9.4.0]
__pyTorch VERSION: <module ‘torch.version’ from ‘/media/jag/NEU/3PAX/ubuntu-webui/env/lib/python3.11/site-packages/torch/version.py’>
__CUDA VERSION
__CUDNN VERSION: 8500
__Number CUDA Devices: 1
__Devices
index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, NVIDIA GeForce GTX 1070, 530.30.02, 8192 MiB, 143 MiB, 7965 MiB
Traceback (most recent call last):
File “/media/jag/NEU/3PAX/ubuntu-webui/test3.py”, line 11, in
print(‘Active CUDA Device: GPU’, torch.cuda.current_device())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/media/jag/NEU/3PAX/ubuntu-webui/env/lib/python3.11/site-packages/torch/cuda/init.py”, line 674, in current_device
_lazy_init()
File “/media/jag/NEU/3PAX/ubuntu-webui/env/lib/python3.11/site-packages/torch/cuda/init.py”, line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

hi, @RiseInRose what does non-powered mean?

我也遇到这个问题,在docker环境下遇到的,操作系统是:
root@iv-yd1ran0cwam0ad91m1ab:~# cat /proc/version
Linux version 5.4.0-133-generic (buildd@lcy02-amd64-003) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)) #149-Ubuntu SMP Mon Nov 14 18:36:06 UTC 2022
母机是是12.3(12.4)也遇到同样的问题:
root@iv-yd1m36hweem0adi50tnx:~# nvidia-smi
Mon Mar 25 14:44:47 2024
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A800-SXM4-80GB Off | 00000000:65:01.0 Off | 0 |
| N/A 30C P0 58W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A800-SXM4-80GB Off | 00000000:65:02.0 Off | 0 |
| N/A 28C P0 60W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+
docker环境也是差不多:

a406134a828(@:):/# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

ca406134a828(@:):/# apt search cudnn
Sorting… Done
Full Text Search… Done
libcudnn9-cuda-12/now 9.0.0.312-1 amd64 [installed,local]
cuDNN runtime libraries for CUDA 12.3

libcudnn9-dev-cuda-12/now 9.0.0.312-1 amd64 [installed,local]
cuDNN development headers and symlinks for CUDA 12.3

pytorch是2.2.0

解决是把所有版本对齐如下:
4e60f119169d(@:):hw2# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
4e60f119169d(@:):hw2# apt search cudnn
Sorting… Done
Full Text Search… Done
libcudnn8/now 8.9.5.29-1+cuda11.8 amd64 [installed,local]
cuDNN runtime libraries

libcudnn8-dev/now 8.9.5.29-1+cuda11.8 amd64 [installed,local]
cuDNN development libraries and headers

4e60f119169d(@:):hw2# vim ./test.py
4e60f119169d(@:):hw2# python3 ./test.py
1.11.0