CUDA fails to reinitialize after system suspend

Hello,

I am running training on a machine with an RTX 3080, Ubuntu 22.04LTS, CUDA 11.6, in a jupyter notebook anaconda environment.

My training data is stored in an external mongo db. When initializing the training, I load a large chunk of this db-dataset to in-memory of my machine, which takes about 20 minutes each time. When I put my machine to sleep mode (“suspend”), afterwards, when trying to resume training, I get this error message and the cuda device is not recognized, anymore:

/home/user/anaconda3/envs/pyt116/lib/python3.10/site-packages/torch/cuda/__init__.py:83:
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up 
environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the
available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0

It seems to be related to a change of the CUDA_VISIBLE_DEVICES flag. The error results in having to reboot the machine and reloading all the data each time, which is a cumbersome process. Thus, can anybody give advice on how to make the system recognize the cuda device, again, after waking the pc?

Thanks!

Best, JZ

1 Like

It’s a known issue of Linux’ suspend mode and the CUDA driver.
Resetting the device as described here might work.

2 Likes

Hello @ptrblck,

unfortunately, this doesn’t solve the issue.
sudo rmmod nvidia_uvm
>> ERROR: module nvidia_uvm is in use
and the second command then doesn’t give anything in response.

Maybe, it’s because I am using a graphical interface on my pc? Or is the nvidia module somehow still tied in the jupyter notebook session?

EDIT: I see in your post that it can be related to IDEs. Is there another solution to this by now, or still as is?

If I’m running into this issue, I need to kill all Python processes (the simple way is just to close your IDE) and check for other processes in nvidia-smi (e.g. Spotify sometimes also grabs the GPU and blocks the device reset) which I could close. Afterwards it’s usually working for me.

Yes, ok, so basically, what you are suggesting is a solution to the problem, but there is no way around closing all IDE’s / python processes. Thus, I’d have to reload my data every time. So, it doesn’t solve my problem, but it seems there is no workaround, anyways.

any update to this problem? is this ubuntu specific or does it persist in other linux distro’s.

Quite a major hindrance if one is used to coding in a style where you do a training run and you need to investigate / extract from different variables…basically continue to work on a single process for a couple hours.

I don’t understand why your use case would need to be suspended.
As far as I know it’s a known Linux issue, but unsure which distros are affected.

Same issue here using arch

The rmmod man page says there’s a -f/--force flag for the error
ERROR: module nvidia_uvm is in use
but it’s not recommended, I did it anyways but it didn’t solve the issue