PyTorch does not see CUDA

Virtualization: microsoft AWS server
Operating System: Ubuntu 18.04.6 LTS
Architecture: x86-64

| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           On   | 00000001:00:00.0 Off |                    0 |
| N/A    1C    P0   ERR! / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |
nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
dpkg -l | grep nvidia

ii  libnvidia-cfg1-470:amd64               470.103.01-0ubuntu1                       amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-470                   470.103.01-0ubuntu1                       all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-470:amd64            470.103.01-0ubuntu1                       amd64        NVIDIA libcompute package
ii  libnvidia-decode-470:amd64             470.103.01-0ubuntu1                       amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-470:amd64             470.103.01-0ubuntu1                       amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-470:amd64              470.103.01-0ubuntu1                       amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-470:amd64               470.103.01-0ubuntu1                       amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-470:amd64                 470.103.01-0ubuntu1                       amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-ifr1-470:amd64               470.103.01-0ubuntu1                       amd64        NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  nvidia-compute-utils-470               470.103.01-0ubuntu1                       amd64        NVIDIA compute utilities
ii  nvidia-dkms-470                        470.103.01-0ubuntu1                       amd64        NVIDIA DKMS package
ii  nvidia-driver-470                      470.103.01-0ubuntu1                       amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-470               470.103.01-0ubuntu1                       amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-470               470.103.01-0ubuntu1                       amd64        NVIDIA kernel source package
ii  nvidia-modprobe                        510.47.03-0ubuntu1                        amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-settings                        510.47.03-0ubuntu1                        amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-470                       470.103.01-0ubuntu1                       amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-470          470.103.01-0ubuntu1                       amd64        NVIDIA binary Xorg driver
dpkg -l | grep cuda

ii  cuda-command-line-tools-11-1           11.1.1-1                                  amd64        CUDA command-line tools
ii  cuda-compiler-11-1                     11.1.1-1                                  amd64        CUDA compiler
ii  cuda-cudart-11-1                       11.1.74-1                                 amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-11-1                   11.1.74-1                                 amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-11-1                    11.1.74-1                                 amd64        CUDA cuobjdump
ii  cuda-cupti-11-1                        11.1.105-1                                amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-11-1                    11.1.105-1                                amd64        CUDA profiling tools interface.
ii  cuda-documentation-11-1                11.1.105-1                                amd64        CUDA documentation
ii  cuda-driver-dev-11-1                   11.1.74-1                                 amd64        CUDA Driver native dev stub library
ii  cuda-drivers-470                       470.103.01-1                              amd64        CUDA Driver meta-package, branch-specific
ii  cuda-gdb-11-1                          11.1.105-1                                amd64        CUDA-GDB
ii  cuda-libraries-11-1                    11.1.1-1                                  amd64        CUDA Libraries 11.1 meta-package
ii  cuda-libraries-dev-11-1                11.1.1-1                                  amd64        CUDA Libraries 11.1 development meta-package
ii  cuda-memcheck-11-1                     11.1.105-1                                amd64        CUDA-MEMCHECK
ii  cuda-nsight-11-1                       11.1.105-1                                amd64        CUDA nsight
ii  cuda-nsight-compute-11-1               11.1.1-1                                  amd64        NVIDIA Nsight Compute
ii  cuda-nsight-systems-11-1               11.1.1-1                                  amd64        NVIDIA Nsight Systems
ii  cuda-nvcc-11-1                         11.1.105-1                                amd64        CUDA nvcc
ii  cuda-nvdisasm-11-1                     11.1.74-1                                 amd64        CUDA disassembler
ii  cuda-nvml-dev-11-1                     11.1.74-1                                 amd64        NVML native dev links, headers
ii  cuda-nvprof-11-1                       11.1.105-1                                amd64        CUDA Profiler tools
ii  cuda-nvprune-11-1                      11.1.74-1                                 amd64        CUDA nvprune
ii  cuda-nvrtc-11-1                        11.1.105-1                                amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-11-1                    11.1.105-1                                amd64        NVRTC native dev links, headers
ii  cuda-nvtx-11-1                         11.1.74-1                                 amd64        NVIDIA Tools Extension
ii  cuda-nvvp-11-1                         11.1.105-1                                amd64        CUDA Profiler tools
ii  cuda-samples-11-1                      11.1.105-1                                amd64        CUDA example applications
ii  cuda-sanitizer-11-1                    11.1.105-1                                amd64        CUDA Sanitizer
ii  cuda-toolkit-11-1                      11.1.1-1                                  amd64        CUDA Toolkit 11.1 meta-package
ii  cuda-tools-11-1                        11.1.1-1                                  amd64        CUDA Tools meta-package
ii  cuda-visual-tools-11-1                 11.1.1-1                                  amd64        CUDA visual tools
>>> import torch
>>> torch.version.cuda
>>> torch.cuda.is_available()
/.local/lib/python3.6/site-packages/torch/cuda/ UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
>>> print(torch.__version__)

Tried to install torch + cu111 and cu113, does not work. Also try to install cu114, but no packages.
I had several questions:

  • why command ‘nvidia-smi’ show CUDA version 11.4, but command ‘dpgk list’ show CUDA version 11.1?
  • what I need to do for pytorch will see GPU?

Your setup seems to use a different kernel mode driver vs. user mode driver (most likely installed via the CUDA toolkit).

Did you “hide” the GPU e.g. via CUDA_VISIBLE_DEVICES?
If not, was this setup ever working or did you just install it (and might have forgotten to e.g. reboot the node)?