RuntimeError: Found no NVIDIA driver on your system Error when running

We are running on EKS 1.25 cluster in AWS and are using the k8s-device-plugin to expose the GPUs to our pods.

  • We are using one of the AWS Optimized GPU AMIs which has the nvidia-drivers and runtime baked into it. For reference - docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami
  • The image is installed with the pytorch library torch2.0.1+cu11.7 and the PyTorch binary ship with their own CUDA runtime (as well as other CUDA libs such as cuBLAS, cuDNN, NCCL, etc.)

When we try to access the GPU

python -c "import torch; print(torch.zeros(1).cuda()); print(torch.cuda.is_available())"

we get error saying RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Output of cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  470.182.03  Fri Feb 24 03:29:56 UTC 2023
GCC version:  gcc version 7.3.1 20180712 (Red Hat 7.3.1-17) (GCC)

Output of nvidia-smi -a from inside the docker container

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   21C    P8    14W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-------------------------------------------------------

Any idea why pytorch is not able to recognize the Nvidia driver even though nvida-smi and cat /proc/driver/nvidia/version output shows that there is one

Also output from python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.27

Python version: 3.9.16 (main, Nov  7 2023, 23:48:10)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.254-170.358.amzn2.x86_64-x86_64-with-glibc2.27
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 470.182.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             3099.792
BogoMIPS:            4999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.22.2
[pip3] torch==2.0.1+cu117
[pip3] torchvision==0.15.2+cu117
[conda] Could not collect

Where it does print the Nvidia driver version: 470.182.03

Note: If I try to install the nvidia drivers in my docker image i.e apt-get -qq install -y cuda-drivers then I do get an error saying

RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

This is when our pods are not running in privileged mode.

But once we enable the privilege mode via securityContext, we are able to access the GPU and the error goes away.
We certainly do not want to enable privilege mode due to Security Concerns.

So, why do we need to install the cuda drivers if it can detect the nvidia-driver on the host and also why do we need to enable privileged access to get everything working

This sounds like an AWS issue with their images and I wouldn’t know how PyTorch is related to the privileged docker env.
Instead of a PyTorch workload you could check any other CUDA application, e.g. the CUDA samples, and I would assume to see the same behavior.

Thanks for your response. I will check with other CUDA application.
Also, I as wondering about the output of python -m torch.utils.collect_env I posted. It is able to detect the Nvidia driver version: 470.182.03 so why would we get

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

when runnning

python -c "import torch; print(torch.zeros(1).cuda()); print(torch.cuda.is_available())"

The error message is raised if PyTorch cannot communicate properly with the driver, so just reporting the version wouldn’t be enough to verify it can also be used.