Issue with pytorch, CUDA and nvidia-drivers

ishackigozi · July 5, 2022, 3:58pm

Hello,
We are having issues with using torch in our python environments. We have two A100’s GPU with Nvidia Driver Version: 470.129.06. CUDA Libraries installed 11.3
root@R940-01:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0

I have installed pytorch
- cudatoolkit=11.3
- pytorch
- torchaudio
- torchvision

But GPU driver and CUDA are not accessible by Pytorch. Please advise

(test) i.kigozi@R940-01:~$ python
Python 3.9.0 (default, Nov 15 2020, 14:28:56)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.cuda.is_available()
False

ptrblck · July 5, 2022, 5:30pm

Did you (re-)install the NVIDIA drivers recently (or the CUDA toolkit) without a restart?
Based on the error I would guess PyTorch has trouble communicating with the driver, so you could check if any other CUDA sample works (or if the binary would work in a docker container). If this doesn’t help, you might need to reinstall the drivers.

ishackigozi · July 5, 2022, 6:17pm

Hello,
Thanks for the response, I just restarted the server and am having different issues.

Python 3.9.0 (default, Nov 15 2020, 14:28:56)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
/home/i.kigozi/.conda/envs/test/lib/python3.9/site-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
exit()
(test) i.kigozi@R940-01:~$ conda list | grep torch
ffmpeg 4.3 hf484d3e_0 pytorch
magma-cuda110 2.5.2 1 pytorch
pytorch-mutex 1.0 cuda pytorch
torch 1.12.0+cu116 pypi_0 pypi
torchaudio 0.12.0+cu116 pypi_0 pypi
torchvision 0.13.0+cu116 pypi_0 pypi
(test) i.kigozi@R940-01:~$ nvidia-smi
Tue Jul 5 13:17:18 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |

cuda version is 11.6

ishackigozi · July 5, 2022, 10:04pm

On further checking on this, A root user is able to get the expect result
root@R940-01:~# python
Python 3.7.2 (default, Dec 29 2018, 06:19:36)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.cuda.is_available()
True
exit()

But for a non-root user, they are not able to do the same.

(test) i.kigozi@R940-01:~$ python
Python 3.9.12 (main, Jun 1 2022, 11:38:51)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.cuda.is_available()
False

Is this a permission issue?

ptrblck · July 6, 2022, 3:59am

This might be a permission issue, but also note that different Python environments are used.
The root user uses Python 3.7.2 while the non-root user uses Python 3.9.12.
This could also mean that someone installed a CPU-only PyTorch version into the Python 3.9.12 environment.

ishackigozi · July 6, 2022, 4:13am

Thank you for the response, please kindly advise on how I can investigate this issues.

For the non root user, I follow the instructions in this link