hey dude, please help me, i search all the google,and i can’t fix it. here is my error
Python 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>>
>>> torch.cuda.is_available()
/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
>>> torch.cuda
<module 'torch.cuda' from '/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py'>
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py", line 674, in current_device
_lazy_init()
File "/home/caturbhuja/2T/conda3_envs/tt/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal
i update driver many times,it still not work. here is my driver version now
Thu Apr 13 10:12:26 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:B3:00.0 Off | Off |
| 30% 60C P0 67W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
In case this setup was working before I would recommend trying to build and execute any CUDA sample to verify that it’s still working as the errors still point to a setup issue.
Actually, when I was using the 2080ti, torch2.0 was working fine. But when I switched to the a6000 and installed the relevant drivers, the same code started throwing errors.
Thanks, dude! you know what? The reason is due to my non-powered 2080ti, because of my laziness, I left it in the computer case without powering it on. Oh my god!When I took it off… ,Surprise !!!
In my case this not works in Ubuntu
Ive got last 535 cuda with 530 driver
nvTop working and shows my CPU, I can play games or start Heaven test same time but…
python test3.py
__Python VERSION: 3.11.4+ (main, Jun 28 2023, 08:52:25) [GCC 9.4.0]
__pyTorch VERSION: <module ‘torch.version’ from ‘/media/jag/NEU/3PAX/ubuntu-webui/env/lib/python3.11/site-packages/torch/version.py’>
__CUDA VERSION
__CUDNN VERSION: 8500
__Number CUDA Devices: 1
__Devices
index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, NVIDIA GeForce GTX 1070, 530.30.02, 8192 MiB, 143 MiB, 7965 MiB
Traceback (most recent call last):
File “/media/jag/NEU/3PAX/ubuntu-webui/test3.py”, line 11, in
print(‘Active CUDA Device: GPU’, torch.cuda.current_device())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/media/jag/NEU/3PAX/ubuntu-webui/env/lib/python3.11/site-packages/torch/cuda/init.py”, line 674, in current_device
_lazy_init()
File “/media/jag/NEU/3PAX/ubuntu-webui/env/lib/python3.11/site-packages/torch/cuda/init.py”, line 247, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
我也遇到这个问题,在docker环境下遇到的,操作系统是:
root@iv-yd1ran0cwam0ad91m1ab:~# cat /proc/version
Linux version 5.4.0-133-generic (buildd@lcy02-amd64-003) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)) #149-Ubuntu SMP Mon Nov 14 18:36:06 UTC 2022
母机是是12.3(12.4)也遇到同样的问题:
root@iv-yd1m36hweem0adi50tnx:~# nvidia-smi
Mon Mar 25 14:44:47 2024
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A800-SXM4-80GB Off | 00000000:65:01.0 Off | 0 |
| N/A 30C P0 58W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA A800-SXM4-80GB Off | 00000000:65:02.0 Off | 0 |
| N/A 28C P0 60W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+
±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+
docker环境也是差不多:
a406134a828(@:):/# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
ca406134a828(@:):/# apt search cudnn
Sorting… Done
Full Text Search… Done
libcudnn9-cuda-12/now 9.0.0.312-1 amd64 [installed,local]
cuDNN runtime libraries for CUDA 12.3
libcudnn9-dev-cuda-12/now 9.0.0.312-1 amd64 [installed,local]
cuDNN development headers and symlinks for CUDA 12.3
pytorch是2.2.0
解决是把所有版本对齐如下:
4e60f119169d(@:):hw2# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
4e60f119169d(@:):hw2# apt search cudnn
Sorting… Done
Full Text Search… Done
libcudnn8/now 8.9.5.29-1+cuda11.8 amd64 [installed,local]
cuDNN runtime libraries
libcudnn8-dev/now 8.9.5.29-1+cuda11.8 amd64 [installed,local]
cuDNN development libraries and headers
4e60f119169d(@:):hw2# vim ./test.py
4e60f119169d(@:):hw2# python3 ./test.py
1.11.0