"CUDA error: unspecified launch failure" after PC suspend

Ihor_Bobak · March 9, 2024, 1:57pm

I have Ubuntu 22.04, 4090 GPU, 2x Xeon e5 2696v4 with 256GB RAM.

Everything is working fine before I suspend my PC. But as soon as I resume after suspend, I am getting this error:

import torch
print(torch.__version__)
torch.cuda.is_available()

2.2.1

/opt/anaconda3/envs/pytorch121/lib/python3.12/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1708025845206/work/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

I just re-installed my drivers:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:81:00.0  On |                  Off |
|  0%   46C    P8             17W /  450W |     422MiB /  24564MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      5126      G   /usr/lib/xorg/Xorg                            148MiB |
|    0   N/A  N/A      5283      G   /usr/bin/gnome-shell                           38MiB |
|    0   N/A  N/A      8111      G   firefox                                       163MiB |
|    0   N/A  N/A     13488      G   ...ures=SpareRendererForSitePerProcess         33MiB |
|    0   N/A  N/A     13675      G   ...ures=SpareRendererForSitePerProcess         16MiB |
+-----------------------------------------------------------------------------------------+

And I make a clean env for the newest version of PyTorch:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

But the update of drivers and creation of new environment did not help: the problem was previously on the driver 545.23.08 and PyTorch 2.0.1 and it reproduced exactly the same: if I make a clean reboot and just work - everything was fine, but as soon as I suspend my PC and then resume it - I am getting the error above.

Restarting of the jupyter kernel (in which the process is being executed) doesn’t help. The only one thing which helps is total reboot of the PC, and this is what I wanted to avoid.

Could anyone give a hint of what can I do to prevent PyTorch behaving like this after PC resumed?

ptrblck · March 9, 2024, 2:35pm

This post might help.

Ihor_Bobak · March 12, 2024, 4:01am

None of those advices work. Command “sudo rmmod nvidia_uvm” gives me “rmmod: ERROR: Module nvidia_uvm is in use”. It was always giving this output, and I check again - this time also.

lsmod | grep nvidia

gives me

nvidia_uvm           4919296  4
nvidia_drm            110592  12
nvidia_modeset       1355776  15 nvidia_drm
nvidia              54099968  307 nvidia_uvm,nvidia_modeset
drm_kms_helper        258048  1 nvidia_drm
video                  69632  1 nvidia_modeset
drm                   708608  16 drm_kms_helper,nvidia,nvidia_drm

and whats next? Totally unclear what should I do next. Rebooting will solve the problem, but it is annoying. However, still don’t see any other option.