I am running training on a machine with an RTX 3080, Ubuntu 22.04LTS, CUDA 11.6, in a jupyter notebook anaconda environment.
My training data is stored in an external mongo db. When initializing the training, I load a large chunk of this db-dataset to in-memory of my machine, which takes about 20 minutes each time. When I put my machine to sleep mode (“suspend”), afterwards, when trying to resume training, I get this error message and the cuda device is not recognized, anymore:
/home/user/anaconda3/envs/pyt116/lib/python3.10/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0
It seems to be related to a change of the CUDA_VISIBLE_DEVICES flag. The error results in having to reboot the machine and reloading all the data each time, which is a cumbersome process. Thus, can anybody give advice on how to make the system recognize the cuda device, again, after waking the pc?