RuntimeError: Unexpected error from cudaGetDeviceCount() when i use a6000

hey dude, please help me, i search all the google,and i can’t fix it. here is my error

Python 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch

>>> 
>>> torch.cuda.is_available()
/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>> torch.cuda
<module 'torch.cuda' from '/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py'>
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py", line 674, in current_device
    _lazy_init()
  File "/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal

here is my torch/cuda version

here is my nvidia driver version

Thu Apr 13 00:27:09 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:B3:00.0 Off |                  Off |
| 30%   30C    P8     9W / 300W |      0MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Your driver is running into an initialization error, so you might need to update/reinstall your drivers.

i update driver many times,it still not work. here is my driver version now

Thu Apr 13 10:12:26 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:B3:00.0 Off | Off |
| 30% 60C P0 67W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

In case this setup was working before I would recommend trying to build and execute any CUDA sample to verify that it’s still working as the errors still point to a setup issue.

Actually, when I was using the 2080ti, torch2.0 was working fine. But when I switched to the a6000 and installed the relevant drivers, the same code started throwing errors.

Thanks, dude! you know what? The reason is due to my non-powered 2080ti, because of my laziness, I left it in the computer case without powering it on. Oh my god!When I took it off… ,Surprise !!!

Good to hear you have isolated and fixed the issue. This is indeed a new one, but makes sense assuming the driver tries to initialize the “dead” GPU.