Hi folks,
I’m running some pytorch code on a few spare workstations at work. I’m hitting an issue where workstations report no cuda device seemingly randomly, and then seem to come good after a period of time. Out of a pool of maybe 30 workstations, around 10 will get this issue at any one time. From day to day (or hour by hour) different workstations are affected.
At first I thought this was due to the display sleeping, but today I have observed it when a user is logged in and using the desktop environment.
From searching, it seems this can happen when drivers are updated but the machine is not rebooted - this is not happening in my case.
Kind of baffled, any help appreciated. I made an effort to find a solution in previous posts, but none of the answers seemed to apply to me.
Could it be a mismatch in the CUDA version installed with the driver (11.2) and the version I’m using in my container (11.0)?
cheers
>>> torch.cuda.is_available()
/opt/conda/envs/pytorch/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370156314/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
Singularity> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
Singularity> python
Python 3.7.10 (default, Feb 26 2021, 18:47:35)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.7.1'
>>> torch.version.cuda
'11.0'
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67 Driver Version: 460.67 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro M5000 Off | 00000000:03:00.0 On | Off |
| 43% 49C P0 48W / 150W | 649MiB / 8126MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 17617 G /usr/bin/X 521MiB |
| 0 N/A N/A 52214 G ...AAAAAAAAA= --shared-files 118MiB |
+-----------------------------------------------------------------------------+