PyTorch does not see CUDA

Virtualization: microsoft AWS server
Operating System: Ubuntu 18.04.6 LTS
Architecture: x86-64
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000001:00:00.0 Off |                    0 |
| N/A    1C    P0   ERR! / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
dpkg -l | grep nvidia

ii  libnvidia-cfg1-470:amd64               470.103.01-0ubuntu1                       amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-470                   470.103.01-0ubuntu1                       all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-470:amd64            470.103.01-0ubuntu1                       amd64        NVIDIA libcompute package
ii  libnvidia-decode-470:amd64             470.103.01-0ubuntu1                       amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-470:amd64             470.103.01-0ubuntu1                       amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-470:amd64              470.103.01-0ubuntu1                       amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-470:amd64               470.103.01-0ubuntu1                       amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-470:amd64                 470.103.01-0ubuntu1                       amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-ifr1-470:amd64               470.103.01-0ubuntu1                       amd64        NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  nvidia-compute-utils-470               470.103.01-0ubuntu1                       amd64        NVIDIA compute utilities
ii  nvidia-dkms-470                        470.103.01-0ubuntu1                       amd64        NVIDIA DKMS package
ii  nvidia-driver-470                      470.103.01-0ubuntu1                       amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-470               470.103.01-0ubuntu1                       amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-470               470.103.01-0ubuntu1                       amd64        NVIDIA kernel source package
ii  nvidia-modprobe                        510.47.03-0ubuntu1                        amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-settings                        510.47.03-0ubuntu1                        amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-470                       470.103.01-0ubuntu1                       amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-470          470.103.01-0ubuntu1                       amd64        NVIDIA binary Xorg driver
dpkg -l | grep cuda

ii  cuda-command-line-tools-11-1           11.1.1-1                                  amd64        CUDA command-line tools
ii  cuda-compiler-11-1                     11.1.1-1                                  amd64        CUDA compiler
ii  cuda-cudart-11-1                       11.1.74-1                                 amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-11-1                   11.1.74-1                                 amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-11-1                    11.1.74-1                                 amd64        CUDA cuobjdump
ii  cuda-cupti-11-1                        11.1.105-1                                amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-11-1                    11.1.105-1                                amd64        CUDA profiling tools interface.
ii  cuda-documentation-11-1                11.1.105-1                                amd64        CUDA documentation
ii  cuda-driver-dev-11-1                   11.1.74-1                                 amd64        CUDA Driver native dev stub library
ii  cuda-drivers-470                       470.103.01-1                              amd64        CUDA Driver meta-package, branch-specific
ii  cuda-gdb-11-1                          11.1.105-1                                amd64        CUDA-GDB
ii  cuda-libraries-11-1                    11.1.1-1                                  amd64        CUDA Libraries 11.1 meta-package
ii  cuda-libraries-dev-11-1                11.1.1-1                                  amd64        CUDA Libraries 11.1 development meta-package
ii  cuda-memcheck-11-1                     11.1.105-1                                amd64        CUDA-MEMCHECK
ii  cuda-nsight-11-1                       11.1.105-1                                amd64        CUDA nsight
ii  cuda-nsight-compute-11-1               11.1.1-1                                  amd64        NVIDIA Nsight Compute
ii  cuda-nsight-systems-11-1               11.1.1-1                                  amd64        NVIDIA Nsight Systems
ii  cuda-nvcc-11-1                         11.1.105-1                                amd64        CUDA nvcc
ii  cuda-nvdisasm-11-1                     11.1.74-1                                 amd64        CUDA disassembler
ii  cuda-nvml-dev-11-1                     11.1.74-1                                 amd64        NVML native dev links, headers
ii  cuda-nvprof-11-1                       11.1.105-1                                amd64        CUDA Profiler tools
ii  cuda-nvprune-11-1                      11.1.74-1                                 amd64        CUDA nvprune
ii  cuda-nvrtc-11-1                        11.1.105-1                                amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-11-1                    11.1.105-1                                amd64        NVRTC native dev links, headers
ii  cuda-nvtx-11-1                         11.1.74-1                                 amd64        NVIDIA Tools Extension
ii  cuda-nvvp-11-1                         11.1.105-1                                amd64        CUDA Profiler tools
ii  cuda-samples-11-1                      11.1.105-1                                amd64        CUDA example applications
ii  cuda-sanitizer-11-1                    11.1.105-1                                amd64        CUDA Sanitizer
ii  cuda-toolkit-11-1                      11.1.1-1                                  amd64        CUDA Toolkit 11.1 meta-package
ii  cuda-tools-11-1                        11.1.1-1                                  amd64        CUDA Tools meta-package
ii  cuda-visual-tools-11-1                 11.1.1-1                                  amd64        CUDA visual tools
>>> import torch
>>> torch.version.cuda
'11.1'
>>> torch.cuda.is_available()
/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>> print(torch.__version__)
1.9.1+cu111

Tried to install torch + cu111 and cu113, does not work. Also try to install cu114, but no packages.
I had several questions:

  • why command ‘nvidia-smi’ show CUDA version 11.4, but command ‘dpgk list’ show CUDA version 11.1?
  • what I need to do for pytorch will see GPU?

Your setup seems to use a different kernel mode driver vs. user mode driver (most likely installed via the CUDA toolkit).

Did you “hide” the GPU e.g. via CUDA_VISIBLE_DEVICES?
If not, was this setup ever working or did you just install it (and might have forgotten to e.g. reboot the node)?

Hello,
I am observing similar issue on my machine as well.
Below is the output of my setup.
Please let me know the best way to install pytorch locally?

nvidia-smi 
Wed Jul 20 18:00:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P8    12W /  N/A |      5MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3289      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+


dpkg -l | grep nvidia
ii  libnvidia-cfg1-470:amd64                470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-470                    470.129.06-1pop0~1656630197~22.04~c52ca60                         all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-470:amd64             470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA libcompute package
ii  libnvidia-compute-470:i386              470.129.06-1pop0~1656630197~22.04~c52ca60                         i386         NVIDIA libcompute package
rc  libnvidia-compute-515:amd64             515.48.07-1pop0~1657640780~22.04~e863eed                          amd64        NVIDIA libcompute package
ii  libnvidia-decode-470:amd64              470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-470:i386               470.129.06-1pop0~1656630197~22.04~c52ca60                         i386         NVIDIA Video Decoding runtime libraries
ii  libnvidia-egl-wayland1:amd64            1:1.1.9-1.1                                                       amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-encode-470:amd64              470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-470:i386               470.129.06-1pop0~1656630197~22.04~c52ca60                         i386         NVENC Video Encoding runtime library
ii  libnvidia-extra-470:amd64               470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-470:amd64                470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-470:i386                 470.129.06-1pop0~1656630197~22.04~c52ca60                         i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-470:amd64                  470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-470:i386                   470.129.06-1pop0~1656630197~22.04~c52ca60                         i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-ifr1-470:amd64                470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  libnvidia-ifr1-470:i386                 470.129.06-1pop0~1656630197~22.04~c52ca60                         i386         NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  nvidia-compute-utils-470                470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA compute utilities
rc  nvidia-cuda-toolkit                     11.5.1-1ubuntu1                                                   amd64        NVIDIA CUDA development toolkit
ii  nvidia-dkms-470                         470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA DKMS package
ii  nvidia-driver-450                       470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        Transitional package for nvidia-driver-470
ii  nvidia-driver-470                       470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-470                470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-470                470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA kernel source package
ii  nvidia-settings                         465.19.01-0ubuntu1                                                amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-470                        470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA driver support binaries
ii  screen-resolution-extra                 0.18.2                                                            all          Extension for the nvidia-settings control panel
ii  xserver-xorg-video-nvidia-470           470.129.06-1pop0~1656630197~22.04~c52ca60                         amd64        NVIDIA binary Xorg driver

python environment pytorch libs
pytorch                   1.10.1          py3.7_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-lightning         1.6.5                    pypi_0    pypi
pytorch-mutex             1.0                        cuda    pytorch

You have already installed an old PyTorch release with the CUDA 11.3 runtime. In case PyTorch cannot use the GPU, it might have trouble to communicate with the driver. Make sure that other CUDA applications can use the GPU and if that’s not possible, try to reinstall the NVIDIA driver.

Thank you for your response.
The nvidia driver route is constrained, as I use system76 popOS.
This forces me to use 470 driver through the packages.
I will try installing an older version using local file route, but might run into a the same wall there.
I will also try running tensorflow on gpu and see if that works with current setup.

Below is my nvcc -V

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

I’ve tested a few 4xx drivers (450.119.04 and 470.57.02) and both work with the binaries on Ubuntu, so I’m unsure what’s causing the issue.
Is torch.version.cuda returning the right CUDA runtime version and are you seeing an init error when calling torch.cuda.is_available()?

Now I am able to see the gpu.
What fixed it for me was to apt install system76 nvidia drivers, which installed the correct pairs.

Tensorflow sees the gpu
>>> import tensorflow as tf
>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
2022-07-21 00:26:52.004648: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 00:26:52.029046: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 00:26:52.029161: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Num GPUs Available:  1

nvidia-smi (latest update)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   46C    P8    18W /  N/A |     46MiB /  8192MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4831      G   /usr/lib/xorg/Xorg                 45MiB |
+-----------------------------------------------------------------------------+

>>> import torch
>>> print(torch.cuda.is_available())
True
>>> print(torch.version.cuda)
11.3