Pytorch is_cuda_available() returning false all of a sudden

Good day all,

I know there have been many answers for similar questions, however I havn’t found a solution. My cuda_is_available() is returning False, but weirdly it used to return True with the same configuration. The code is running on a GPU cluster. I was wondering if any people wiser than me can identify anything wrong with the following configuration:

PyTorch version: 1.4.0
Is debug build: False
CUDA used to build PyTorch: 10.0
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.9.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.6 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB
GPU 2: Tesla V100-PCIE-16GB

Nvidia driver version: 418.40.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] numpydoc==1.1.0
[pip3] torch==1.4.0
[pip3] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.0.130                      0
[conda] mkl                       2020.1                      217
[conda] mkl-service               2.3.0            py36he904b0f_0
[conda] mkl_fft                   1.1.0            py36h23d657b_0
[conda] mkl_random                1.1.1            py36h0573a6f_0
[conda] numpy                     1.18.5           py36ha1c710e_0
[conda] numpy-base                1.18.5           py36hde5b4d6_0
[conda] numpydoc                  1.1.0                      py_0
[conda] pytorch                   1.4.0           py3.6_cuda10.0.130_cudnn7.6.3_0    pytorch
[conda] torchvision               0.5.0                py36_cu100    pytorch

Here is the nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                  Off |
| N/A   31C    P0    25W / 250W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:AF:00.0 Off |                  Off |
| N/A   32C    P0    27W / 250W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                  Off |
| N/A   30C    P0    22W / 250W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I appreciate any assistance. Thank you!

Are other CUDA programs working on this server?
If so, did you change anything on this machine, e.g. did you upgrade some drivers without restarting the node?

1 Like

It is possible that something could have changed without me knowing about it, since it is not my server. However the nvidia-smi output is still what I posted above. Perhaps I should try reinstall pytorch? Is there another Cuda test I could run?

The only thing that I tried to do was use another conda environment with the latest version of pytorch, version 1.6.0 with cuda 10.1. This latest pytorch version had the same problem unfortunately.

After e.g. a driver change, the nvidia-smi output might still be valid, but CUDA programs might not work properly and thus you should restart the system.
As a test you could compile any CUDA sample and run it. Since other conda envs are also not working, I assume that someone might have indeed changed some drivers or runtimes on the server.

1 Like

Thanks @ptrblck, I’ll do some investigation.

It seems that it definitely the problem on the cluster’s side. They resolved it after I queried them. Cheers!