CUDA device unavailable in pytorch, seemingly randomly

Hi folks,

I’m running some pytorch code on a few spare workstations at work. I’m hitting an issue where workstations report no cuda device seemingly randomly, and then seem to come good after a period of time. Out of a pool of maybe 30 workstations, around 10 will get this issue at any one time. From day to day (or hour by hour) different workstations are affected.

At first I thought this was due to the display sleeping, but today I have observed it when a user is logged in and using the desktop environment.

From searching, it seems this can happen when drivers are updated but the machine is not rebooted - this is not happening in my case.

Kind of baffled, any help appreciated. I made an effort to find a solution in previous posts, but none of the answers seemed to apply to me.

Could it be a mismatch in the CUDA version installed with the driver (11.2) and the version I’m using in my container (11.0)?

cheers

>>> torch.cuda.is_available()
/opt/conda/envs/pytorch/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /opt/conda/conda-bld/pytorch_1607370156314/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
Singularity> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
Singularity> python
Python 3.7.10 (default, Feb 26 2021, 18:47:35) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.7.1'
>>> torch.version.cuda
'11.0'
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.67       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 00000000:03:00.0  On |                  Off |
| 43%   49C    P0    48W / 150W |    649MiB /  8126MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+                                                           
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     17617      G   /usr/bin/X                        521MiB |
|    0   N/A  N/A     52214      G   ...AAAAAAAAA= --shared-files      118MiB |
+-----------------------------------------------------------------------------+

I doubt the issue is caused by the local CUDA toolkit installation and it is usually caused by a misconfiguration of the system. Since you’ve already eliminated driver updates and are seeing this issue randomly occurring and disappearing, it would be interesting to know what “state” the systems are in when the error happens. I.e. are these system in any kind of hibernate status etc.

Cheers, so we’re unable to find any sort of pattern unfortunately.

We have taken PyTorch itself out of the equation, but reproducing the cuda availability issue in TensorFlow and also Blender. I put a forum post over at the Nvidia forums after this discovery)

I can confirm the machines are not in any hibernate status. We’ve seen this error happen even when a user is logged into a Gnome session and using OpenGL/OpenCL software such as Autodesk Maya (nvidia-smi type C+G). Rebooting the affected machine seems to fix.

It’s very weird. Graphics work fine but CUDA intermittently drops out.