Pytorch seems to have suddenly stopped using GPU even tgough it worked perfectly fine before an Ubuntu reboot

SpaceExp_NN · April 20, 2021, 11:18pm

I am running an RL training sessions using Pytorch and I have been able to run all my training sessions on my GPU for past month without any issues. For some unknown reason my computer rebooted today and ever since that, I am not able to run my training code on the GPU. I keep getting an error that says “RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable”. I found a similar post on this forum that suggested rebooting the computer but that hasn’t worked either. I even purged and reinstalled all my NVIDIA drivers and I am able to recognize them from the terminal. I will attached a few software related specifics to this message. Any advice on how to resolve this issue would be greatly appreciated!

$ nvcc --version
Cuda compilation tools, release 9.1, V9.1.85

$ conda list
cudatoolkit 10.1.243
pytorch 1.8.1
torch 1.7.1 pypi_0

$ nvidia-smi
NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2

$ torch.cuda.is_available()
False

$ torch.cuda.device_count()
0

ptrblck · April 21, 2021, 10:45pm

This issue can be hit e.g. if your system automatically updated some drivers and failed in doing so.
Based on your information it seems that you have a local CUDA9.1 toolkit installation as well as a new 460 driver. Did you update this driver manually or could it have been updated by the system?

SpaceExp_NN · April 23, 2021, 4:59pm

In my conda environment, I updated the CUDA manually because I read elsewhere that CUDAtoolkit 10.1 would resolve the issue (it didn’t). The 460 driver was installed after I purged all the NVIDIA drivers and re-installed it. I am not sure which cudatoolkit I had before I changed it to 10.1 but I am certain that I after 460 in there even before I purged it.
EDIT: I just checked my nvidia-smi and it says that I have CUDA version 11.2 but my nvcc --version says that I have “Cuda compilation tools, release 9.1, V9.1.85”. Is this nomal or could this be my source of error?