Pytorch is_cuda_available() returning false all of a sudden

savvas17 · October 18, 2020, 3:54pm

Good day all,

I know there have been many answers for similar questions, however I havn’t found a solution. My cuda_is_available() is returning False, but weirdly it used to return True with the same configuration. The code is running on a GPU cluster. I was wondering if any people wiser than me can identify anything wrong with the following configuration:

PyTorch version: 1.4.0
Is debug build: False
CUDA used to build PyTorch: 10.0
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.9.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.6 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB
GPU 2: Tesla V100-PCIE-16GB

Nvidia driver version: 418.40.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] numpydoc==1.1.0
[pip3] torch==1.4.0
[pip3] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.0.130                      0
[conda] mkl                       2020.1                      217
[conda] mkl-service               2.3.0            py36he904b0f_0
[conda] mkl_fft                   1.1.0            py36h23d657b_0
[conda] mkl_random                1.1.1            py36h0573a6f_0
[conda] numpy                     1.18.5           py36ha1c710e_0
[conda] numpy-base                1.18.5           py36hde5b4d6_0
[conda] numpydoc                  1.1.0                      py_0
[conda] pytorch                   1.4.0           py3.6_cuda10.0.130_cudnn7.6.3_0    pytorch
[conda] torchvision               0.5.0                py36_cu100    pytorch

Here is the nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                  Off |
| N/A   31C    P0    25W / 250W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:AF:00.0 Off |                  Off |
| N/A   32C    P0    27W / 250W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                  Off |
| N/A   30C    P0    22W / 250W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I appreciate any assistance. Thank you!

ptrblck · October 19, 2020, 5:43am

Are other CUDA programs working on this server?
If so, did you change anything on this machine, e.g. did you upgrade some drivers without restarting the node?

savvas17 · October 19, 2020, 6:01am

It is possible that something could have changed without me knowing about it, since it is not my server. However the nvidia-smi output is still what I posted above. Perhaps I should try reinstall pytorch? Is there another Cuda test I could run?

The only thing that I tried to do was use another conda environment with the latest version of pytorch, version 1.6.0 with cuda 10.1. This latest pytorch version had the same problem unfortunately.

ptrblck · October 19, 2020, 6:06am

After e.g. a driver change, the nvidia-smi output might still be valid, but CUDA programs might not work properly and thus you should restart the system.
As a test you could compile any CUDA sample and run it. Since other conda envs are also not working, I assume that someone might have indeed changed some drivers or runtimes on the server.

savvas17 · October 19, 2020, 6:32am

Thanks @ptrblck, I’ll do some investigation.

savvas17 · October 19, 2020, 9:07am

It seems that it definitely the problem on the cluster’s side. They resolved it after I queried them. Cheers!