Torch.cuda.is_available() kept switching to False

Hi team!

torch.cuda.is_available() kept switching to False on me. Can someone please shed some lights on how to debug this, my setup:

Nvidia Driver version: 510.108.03
CUDA: 11.6
Pytorch: 1.12.1+cu116

I installed pytorch through pip
pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

I also tried python -m torch.utils.collect_env, the output I got when cuda.is_available() = True

Collecting environment information...
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.13.0-1031-aws-x86_64-with-debian-bullseye-sid
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.12.1+cu116
[conda] numpy                     1.21.6                   pypi_0    pypi
[conda] torch                     1.12.1+cu116             pypi_0    pypi

The output I got when cuda.is_available returned False

Collecting environment information...
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.13.0-1031-aws-x86_64-with-debian-bullseye-sid
Is CUDA available: False
CUDA runtime version: 11.6.124
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.12.1+cu116
[conda] numpy                     1.21.6                   pypi_0    pypi
[conda] torch                     1.12.1+cu116             pypi_0    pypi

Behavior observed:
torch.cuda.is_available() return True for a while and switching to False. I am running this in a Kubernete cluster, I had to bounce the pod to get torch.cuda.is_available() returns True. I am using AWS g4 instance with NVIDIA T4 GPUs.

Thanks in advance!

Are you observing different behavior on different nodes with the same machine image? In that case I would check if there could be something unusual on the machine e.g., via nvidia-smi.

I am not observing different behavior on different nodes. torch.cuda.is_available() switch to False on the same node. I was able to reproduce this on different nodes with same machine image.

We don’t have nvidia-smi installed, would that matter for pytorch to detect GPU? I see the following with nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

It is not a requirement to use PyTorch, but it may be useful to diagnose e.g., driver or hardware issues if those are the root cause.