CUDA fails to reinitialize after system suspend

jayz · August 2, 2022, 9:04am

Hello,

I am running training on a machine with an RTX 3080, Ubuntu 22.04LTS, CUDA 11.6, in a jupyter notebook anaconda environment.

My training data is stored in an external mongo db. When initializing the training, I load a large chunk of this db-dataset to in-memory of my machine, which takes about 20 minutes each time. When I put my machine to sleep mode (“suspend”), afterwards, when trying to resume training, I get this error message and the cuda device is not recognized, anymore:

/home/user/anaconda3/envs/pyt116/lib/python3.10/site-packages/torch/cuda/__init__.py:83:
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up 
environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the
available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0

It seems to be related to a change of the CUDA_VISIBLE_DEVICES flag. The error results in having to reboot the machine and reloading all the data each time, which is a cumbersome process. Thus, can anybody give advice on how to make the system recognize the cuda device, again, after waking the pc?

Thanks!

Best, JZ

ptrblck · August 2, 2022, 9:10am

It’s a known issue of Linux’ suspend mode and the CUDA driver.
Resetting the device as described here might work.

jayz · August 3, 2022, 8:55am

Hello @ptrblck,

unfortunately, this doesn’t solve the issue.
sudo rmmod nvidia_uvm
>> ERROR: module nvidia_uvm is in use
and the second command then doesn’t give anything in response.

Maybe, it’s because I am using a graphical interface on my pc? Or is the nvidia module somehow still tied in the jupyter notebook session?

EDIT: I see in your post that it can be related to IDEs. Is there another solution to this by now, or still as is?

ptrblck · August 3, 2022, 9:12am

If I’m running into this issue, I need to kill all Python processes (the simple way is just to close your IDE) and check for other processes in nvidia-smi (e.g. Spotify sometimes also grabs the GPU and blocks the device reset) which I could close. Afterwards it’s usually working for me.

jayz · August 3, 2022, 11:31am

Yes, ok, so basically, what you are suggesting is a solution to the problem, but there is no way around closing all IDE’s / python processes. Thus, I’d have to reload my data every time. So, it doesn’t solve my problem, but it seems there is no workaround, anyways.

vr308 · February 1, 2024, 10:42pm

any update to this problem? is this ubuntu specific or does it persist in other linux distro’s.

Quite a major hindrance if one is used to coding in a style where you do a training run and you need to investigate / extract from different variables…basically continue to work on a single process for a couple hours.

ptrblck · February 2, 2024, 3:06am

I don’t understand why your use case would need to be suspended.
As far as I know it’s a known Linux issue, but unsure which distros are affected.

TFW · February 16, 2024, 12:15pm

Same issue here using arch

The rmmod man page says there’s a -f/--force flag for the error
ERROR: module nvidia_uvm is in use
but it’s not recommended, I did it anyways but it didn’t solve the issue

rsc3 · August 8, 2024, 10:39pm

Still seeing this behavior with Ubuntu 22.04. And now I see gnome shell always grabs 3MiB in nvidia-smi so effectively I have to log out and log back in to reload CUDA.

Nielius · August 26, 2024, 11:25am

I found a solution here: nvidia - How do I fix CUDA breaking after suspend? - Ask Ubuntu

As far as I understand, it changes the way that suspend/resume stores the GPU’s memory.