RuntimeError: Unexpected error from cudaGetDeviceCount()

Samah_Abu_saleem · December 22, 2021, 10:10am

I was training GCN model on my Linux server and I suddenly got this error.

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

Pytorch version: 1.10.1+cu102
OS: Linux
Python version: Python 3.8.10
CUDA Version: 11.2

ptrblck · December 22, 2021, 11:19am

Is nvidia-smi returning any errors and complains about a driver mismatch? If so, could you restart the server and check if it helps? If not, did you recently update any drivers or are you manually trying to get forward compatibility working on non-server GPUs?

Samah_Abu_saleem · December 22, 2021, 11:46am

No, it doesn’t return any errors:

NVIDIA-SMI 450.57, Driver Version: 450.57 , CUDA Version: 11.2

I have restarted it many times but still the same problem.

I didn’t do any updates. I installed PyTorch and it’s installing successfully

Samah_Abu_saleem · December 22, 2021, 11:49am

Sir, by doing:

! python -c "import torch; print(torch.cuda.is_available())

I got:

/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
False

ptrblck · December 22, 2021, 7:25pm

Based on this issue other users were running into the same error message if

their setup was broken due to a driver/library mismatch (rebooting seemed to solve the issue)
their installed drivers didn’t match the user-mode driver inside a docker container (and forward compatibility failed due to the usage of non-server GPUs)

Was your setup working before and if so, what changed?

Samah_Abu_saleem · December 29, 2021, 12:10pm

Thank you Sir :). My problem is solved.
By doing:

!pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
!pip3 install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip3 install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip3 install torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip3 install torch-geometric

JiamingLiu-Jeremy · July 24, 2022, 4:53am

Hi ptrblck,

Thanks for your valuable comments.

I might have very similar issue below. I’m implementing DDP training on HPCs (with SLURM or LSF), where each node has 4 V100 GPUs. Without any changes of my code, envs etc (pyotrch=1.9.0, cuda 10.2), I randomly meet this issue very recently. As a results, I found torch.cuda.is_available=False on few nodes while CUDA on most nodes do available. It’s hard to reboot the cluster, do you have some suggestions for further debugging?

[UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0]

ptrblck · July 24, 2022, 5:14am

Do you receive this message randomly?
If so, I would guess your system encountered any kind of issue and might have dropped the GPU.
Based on the error message I would have guessed you are running into a setup issue, but this wouldn’t explain the randomness (assuming you are indeed seeing these errors randomly). Are you seeing any Xids in dmesg?

JiamingLiu-Jeremy · July 24, 2022, 9:57pm

Hi ptrblck,

Thanks for the quick reply. I think you are right about the “randomness”, and sorry for the confusion. I think the “randomness” comes from the randomly signed computing nodes at each time submitting a job. If it is the setup issue of pytorch, cuda and driver, how can most nodes work well, but few of nodes are failed, assuming each computing node is identically setup?

Buy running dmesg, I do find some Xids as examples below . But I don’t think these Xids make torch.cuda unavailable, since I can succesfully perform DDP trainig with these Xids.

ptrblck · July 24, 2022, 11:24pm

The Xids show that you are running into a page fault, which is usually an illegal memory access caused in the software stack somewhere.

Does this mean that the same node causes the issue if it’s selected in your env?
If so, check the node’s health status as it seems it has trouble with the driver.

JiamingLiu-Jeremy · August 11, 2022, 6:55pm

Hi ptrbick,

Thanks for your suggestions. A little update here, it’s indeed because of the node’s health status. After rebooting the nodes, the problem was solved (at least partially).

M_Saiful_Bari · November 1, 2022, 8:36pm

Just to confirm, same here. Randomly met this error. In my last run the code was running properly. Suddenly it start producing this error. I tried rebooting but didn’t work. Checked cuda runtimes, all good.
I would really appreciate some help.

vetsa.sai · March 7, 2024, 4:28pm

Hi @ptrblck
I wanted to build a docker to run a repository which specifically only works for PyTorch1.90+cu11.1
I had everything working on my local PC (which is also host for docker) which has built with

Ubuntu 22.04
NVIDIA-SMI 470.239.06
Driver Version: 470.239.06
CUDA Version: 11.4
torch1.9.0+cu11.1

When I try to replicate this on docker image built upon nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04 that has below versions:

Ubuntu 22.04
NVIDIA-SMI 470.239.06
Driver Version: 470.239.06
CUDA Version: 11.7
nvcc 11.7
torch1.9.0+cu11.1

and then torch.cuda.is_available() gives me this error.

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
False

Please suggest what can be done in this case! This combination works on Local PC but not on docker.

PS:
If you suggest me to use ‘nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04’ to match the CUDA version, it would give me below error… which made me to upgrade to 22.04/11.7

ImportError: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32’ not found (required by /opt/conda/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)

ptrblck · March 7, 2024, 8:44pm

I don’t know the differences between your local run and the docker one, but the same error was fixed here by reinstalling the NVIDIA driver.

Clark_Benham · July 8, 2024, 9:04pm

What solved my issue was installing and starting the fabric manager:

sudo apt install nvidia-fabricmanager-535 libnvidia-nscq-535
sudo systemctl start nvidia-fabricmanager # this was likely the missing piece

I’ve got an 8 gpu server I’m using and needed to start this service so they could talk to each other.

(I also found https://ubuntu.com/server/docs/nvidia-drivers-installation helpful for only installing the drivers, and CUDA Installation Guide for Linux for removing previous installs. )

surajpatil4899 · September 23, 2024, 2:26pm

I am getting same issue. nvidia-smi is working. I have setting up cuda in wsl ubuntu 22.4. I have manually updated cuda drivers on windows can you help.