CUDA initialization: Unexpected error from cudaGetDeviceCount(),Torch.cuda.is_available() returns False with cuda11.6 or (cuda10.2),How do I solve it?

LUO77123 · July 15, 2022, 12:28pm

I created multiple environments based on Cuda10.2 and Cuda11.6 and reported the same error on the A100 card。 CUDA initialization: Unexpected error from cudaGetDeviceCount(),How do I solve it?
nvidia-smi =11.6, A100
1.export Cuda10.2->nvcc -V=10.2
conda create -n XXX python=3.9
conda activate XXX
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2 -c pytorch
python
import torch
torch.cuda.is_available()
return False
2.export Cuda11.6->nvcc -V=11.6
conda create -n XXX python=3.9
conda activate XXX
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge (or) pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
python
import torch
torch.cuda.is_available()
return False

Such errors are reported（UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /opt/conda/conda-bld/pytorch_1639180549130/work/c10/cuda/CUDAFunctions.cpp:112.)）

Version:
eg:nvcc=11.6,nvidia-smi=11.6,pytorch=1.12

ptrblck · July 15, 2022, 5:21pm

It seems your driver has trouble to initialize the device:

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized

Was this setup working before at all? If so, what changed which might cause the issue (e.g. did you update the drivers etc.)?

LUO77123 · July 16, 2022, 2:10pm

This is a newly launched server. I tried to call GPU, and the above problem occurred.

ptrblck · July 16, 2022, 6:15pm

OK, I see. In this case you could check if any other CUDA sample would run (I would expect to see the same or similar errors) and might need to reinstall the drivers in this case.

bask0 · August 9, 2022, 7:35am

Any news here? We are having exactly the same problem, A40s work well but not A100s. The configuration of A40 and A100 nodes are identical. I tested with different PyTorch versions (1.12, 1.11, 1.10).

Mollylulu · October 26, 2022, 4:27am

seem error, have you solved this problem?

ptrblck · October 26, 2022, 4:45am

No updates, as @LUO77123 didn’t follow up and the error is usually raised by a config error.
Check if any CUDA applications can run and make sure the right PyTorch binary with the desired CUDA runtime is installed.

bask0 · November 1, 2022, 9:58am

This did solve the issue (missing nvidia-fabricmanager): Notes on using NVIDIA A100 (40GB) - Deep Learning - fast.ai Course Forums

qgallouedec · January 21, 2023, 6:52pm

Reboot worked for me.

zhanwenchen · June 21, 2023, 10:38am

I had a problem with a 4xA100 box where both host CUDA samples and docker PyTorch self-contained binaries returned cudaGetDeviceCount errors. I just figured out the root cause: A100 has MIG mode enabled by default, thus making getDeviceCount() error out.

I solved it by using sudo nvidia-smi -mig 0 to disable MIG mode, reboot since it was pending (perhaps due to loginctl linger settings), and the do sudo nvidia-smi -mig 0 again.

Dj_Jag · August 20, 2023, 5:40am

I had this problem 3 month, no solution (ERROR 802 in Cuda Samples) Cuda 11.8, installed Nvidia drivers and toolkit
Fabric manager - unuseless