Common issue, but I really cannot root cause this one.
Trying to set up PyTorch + Cuda on an AWS p3 instance (Nvidia Tesla V100 GPUs).
Torch output:
torch.__version__ # 2.0.1+cu117
torch.cuda.device_count() # --> 0
torch.cuda.is_available() # --> False
torch.version.cuda # --> 11.7
torch.backends.cudnn.version() # 8500
torch.zeros(1).cuda() # “RuntimeError: Found no NVIDIA driver on your system”
pip list
output:
Package Version
------------------------ ----------
cmake 3.26.3
filelock 3.12.0
Jinja2 3.1.2
lit 16.0.5
MarkupSafe 2.1.2
mpmath 1.3.0
networkx 3.1
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
pip 22.3.1
setuptools 65.6.3
sympy 1.12
torch 2.0.1
triton 2.0.0
typing_extensions 4.6.2
wheel 0.40.0
nvidia-smi output
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:17.0 Off | 0 |
| N/A 33C P0 57W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:18.0 Off | 0 |
| N/A 32C P0 56W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:19.0 Off | 0 |
| N/A 33C P0 56W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:1A.0 Off | 0 |
| N/A 34C P0 55W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:00:1B.0 Off | 0 |
| N/A 33C P0 55W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:00:1C.0 Off | 0 |
| N/A 32C P0 56W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:00:1D.0 Off | 0 |
| N/A 32C P0 58W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
| N/A 33C P0 55W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvcc --version
output
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
Also, I’ve downloaded the Nvidia samples using git clone https://github.com/NVIDIA/cuda-samples.git --branch v11.6
and then ran them using make
, and they all seem to run fine …
Thanks.
Edit:
- ChatGPT suggested I check the permissions, so here is the output although I don’t find anything specific:
Running ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Jun 1 02:56 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Jun 1 02:56 /dev/nvidia1
crw-rw-rw- 1 root root 195, 2 Jun 1 02:56 /dev/nvidia2
crw-rw-rw- 1 root root 195, 3 Jun 1 02:56 /dev/nvidia3
crw-rw-rw- 1 root root 195, 4 Jun 1 02:56 /dev/nvidia4
crw-rw-rw- 1 root root 195, 5 Jun 1 02:56 /dev/nvidia5
crw-rw-rw- 1 root root 195, 6 Jun 1 02:56 /dev/nvidia6
crw-rw-rw- 1 root root 195, 7 Jun 1 02:56 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Jun 1 02:56 /dev/nvidiactl
/dev/nvidia-caps:
total 0
cr-------- 1 root root 248, 1 Jun 1 02:56 nvidia-cap1
cr--r--r-- 1 root root 248, 2 Jun 1 02:56 nvidia-cap2
-
Also checked the CUDA_HOME variable:
echo $CUDA_HOME
now returns/usr/local/cuda-11.7
(similar path towhich nvcc
that returns/usr/local/cuda-11.7/bin/nvcc
) -
If that’s relevant, I’m running Python 3.9 in a virtual environment created specifically for this.