I have Ubuntu 22.04, 4090 GPU, 2x Xeon e5 2696v4 with 256GB RAM.
Everything is working fine before I suspend my PC. But as soon as I resume after suspend, I am getting this error:
import torch
print(torch.__version__)
torch.cuda.is_available()
2.2.1
/opt/anaconda3/envs/pytorch121/lib/python3.12/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1708025845206/work/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
I just re-installed my drivers:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:81:00.0 On | Off |
| 0% 46C P8 17W / 450W | 422MiB / 24564MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 5126 G /usr/lib/xorg/Xorg 148MiB |
| 0 N/A N/A 5283 G /usr/bin/gnome-shell 38MiB |
| 0 N/A N/A 8111 G firefox 163MiB |
| 0 N/A N/A 13488 G ...ures=SpareRendererForSitePerProcess 33MiB |
| 0 N/A N/A 13675 G ...ures=SpareRendererForSitePerProcess 16MiB |
+-----------------------------------------------------------------------------------------+
And I make a clean env for the newest version of PyTorch:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
But the update of drivers and creation of new environment did not help: the problem was previously on the driver 545.23.08 and PyTorch 2.0.1 and it reproduced exactly the same: if I make a clean reboot and just work - everything was fine, but as soon as I suspend my PC and then resume it - I am getting the error above.
Restarting of the jupyter kernel (in which the process is being executed) doesn’t help. The only one thing which helps is total reboot of the PC, and this is what I wanted to avoid.
Could anyone give a hint of what can I do to prevent PyTorch behaving like this after PC resumed?