The image is installed with the pytorch library torch2.0.1+cu11.7 and the PyTorch binary ship with their own CUDA runtime (as well as other CUDA libs such as cuBLAS, cuDNN, NCCL, etc.)
we get error saying RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Output of cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.182.03 Fri Feb 24 03:29:56 UTC 2023
GCC version: gcc version 7.3.1 20180712 (Red Hat 7.3.1-17) (GCC)
Output of nvidia-smi -a from inside the docker container
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 21C P8 14W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-------------------------------------------------------
Any idea why pytorch is not able to recognize the Nvidia driver even though nvida-smi and cat /proc/driver/nvidia/version output shows that there is one
Note: If I try to install the nvidia drivers in my docker image i.e apt-get -qq install -y cuda-drivers then I do get an error saying
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
This is when our pods are not running in privileged mode.
But once we enable the privilege mode via securityContext, we are able to access the GPU and the error goes away.
We certainly do not want to enable privilege mode due to Security Concerns.
So, why do we need to install the cuda drivers if it can detect the nvidia-driver on the host and also why do we need to enable privileged access to get everything working
This sounds like an AWS issue with their images and I wouldn’t know how PyTorch is related to the privileged docker env.
Instead of a PyTorch workload you could check any other CUDA application, e.g. the CUDA samples, and I would assume to see the same behavior.
Thanks for your response. I will check with other CUDA application.
Also, I as wondering about the output of python -m torch.utils.collect_env I posted. It is able to detect the Nvidia driver version: 470.182.03 so why would we get
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
The error message is raised if PyTorch cannot communicate properly with the driver, so just reporting the version wouldn’t be enough to verify it can also be used.