CUDA fails to reinitialize after system suspend

Hello,

I am running training on a machine with an RTX 3080, Ubuntu 22.04LTS, CUDA 11.6, in a jupyter notebook anaconda environment.

My training data is stored in an external mongo db. When initializing the training, I load a large chunk of this db-dataset to in-memory of my machine, which takes about 20 minutes each time. When I put my machine to sleep mode (“suspend”), afterwards, when trying to resume training, I get this error message and the cuda device is not recognized, anymore:

/home/user/anaconda3/envs/pyt116/lib/python3.10/site-packages/torch/cuda/__init__.py:83:
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up 
environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the
available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0

It seems to be related to a change of the CUDA_VISIBLE_DEVICES flag. The error results in having to reboot the machine and reloading all the data each time, which is a cumbersome process. Thus, can anybody give advice on how to make the system recognize the cuda device, again, after waking the pc?

Thanks!

Best, JZ

It’s a known issue of Linux’ suspend mode and the CUDA driver.
Resetting the device as described here might work.

1 Like

Hello @ptrblck,

unfortunately, this doesn’t solve the issue.
sudo rmmod nvidia_uvm
>> ERROR: module nvidia_uvm is in use
and the second command then doesn’t give anything in response.

Maybe, it’s because I am using a graphical interface on my pc? Or is the nvidia module somehow still tied in the jupyter notebook session?

EDIT: I see in your post that it can be related to IDEs. Is there another solution to this by now, or still as is?

If I’m running into this issue, I need to kill all Python processes (the simple way is just to close your IDE) and check for other processes in nvidia-smi (e.g. Spotify sometimes also grabs the GPU and blocks the device reset) which I could close. Afterwards it’s usually working for me.

Yes, ok, so basically, what you are suggesting is a solution to the problem, but there is no way around closing all IDE’s / python processes. Thus, I’d have to reload my data every time. So, it doesn’t solve my problem, but it seems there is no workaround, anyways.