Nvidia driver & cudatoolkit installed properly but check_driver fails

Ranahanocka · August 25, 2019, 3:46pm

I have successfully installed NVIDIA driver & cudatoolkit via conda. However, I am not able to use cuda in pytorch (even though it installed successfully).

Previously, I was using Pytorch with CUDA 8.0, and wanted to upgrade. I removed / purge all CUDA through:

sudo apt-get --purge remove cuda
sudo apt-get autoremove
dpkg --list |grep "^rc" | cut -d " " -f 3 | xargs sudo dpkg --purge

Then I updated my Nvidia drivers to 4.10 via PPA (Ubuntu 16.04):

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-410

Everything worked smoothly. The output of nvidia-smi:

Fri Aug 23 22:29:48 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 25%   35C    P8    13W / 250W |    531MiB / 11177MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1445      G   /usr/lib/xorg/Xorg                           317MiB |
|    0      2035      G   compiz                                       101MiB |
|    0      3572      G   ...uest-channel-token=13099850080781834209   110MiB |
+-----------------------------------------------------------------------------+

The output of cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  410.78  Sat Nov 10 22:09:04 CST 2018
GCC version:  gcc version 4.9.4 (Ubuntu 4.9.4-2ubuntu1~16.04)

Since I wanted conda to manage my CUDA version, I installed the cudatoolkit through conda env (python 3.6):

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

again, everything installs perfectly. When I run:

print(torch.cuda.device_count()) # --> 0
print(torch.version.cuda) # --> 10.0.130

but using cuda fails. I get the following error message

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rana/anaconda3/envs/py36torch12cu10/lib/python3.6/site-packages/torch/cuda/__init__.py", line 178, in _lazy_init
    _check_driver()
  File "/home/rana/anaconda3/envs/py36torch12cu10/lib/python3.6/site-packages/torch/cuda/__init__.py", line 99, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

I restarted, removed all irrelevant environment variables which may have caused issues (LD_LIBRARY_PATH), removed conda, reinstalled, tried cuda 9.2, but nothing works. I am not sure what the issue could be. Any ideas?

I searched a bit, and found this pytorch thread. Since I completely removed CUDA from my system this shouldn’t be the problem, but I think somehow it may be related.

EDIT:
It isn’t surprising given my error, but following this issue, I checked:
torch._C._cuda_getDriverVersion() # -> 0

ptrblck · August 25, 2019, 9:28pm

That’s some good debugging.
Could you post the output of dpkg -l | grep -i nvidia?
Probably unrelated to this issue, but are you using a secure boot option?

Ranahanocka · August 26, 2019, 7:07am

Good call… the output of dpkg -l | grep -i nvidia is

ii  bbswitch-dkms                              0.8-3ubuntu1                                 amd64        Interface for toggling the power on NVIDIA Optimus video cards
rc  nvidia-384                                 384.90-0ubuntu0.16.04.1                      amd64        NVIDIA binary driver - version 384.90
hi  nvidia-410                                 410.78-0ubuntu0~gpu16.04.1                   amd64        NVIDIA binary driver - version 410.78
rc  nvidia-opencl-icd-384                      384.90-0ubuntu0.16.04.1                      amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                               0.8.2                                        amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                            384.81-0ubuntu1                              amd64        Tool for configuring the NVIDIA graphics driver

very odd. so it seems the old driver (384) is still around. What do you think the best way to fix this is?
About secure boot: I don’t think I changed the bootloader, and I am not running dualboot…

ptrblck · August 26, 2019, 10:46am

Thanks for the information.
Based on the codes, it looks like 384.81 is still installed (at least nvidia-settings) and still contains config files. I would recommend to purge all drivers, and reinstall the latest (or desired) one.

Ranahanocka · August 28, 2019, 2:17pm

Yay, it works! Posting my solution:

Just purged nvidia by running:

sudo apt-get remove --purge '^nvidia-.*'

after reinstalling 410 via ppa, the output of dpkg -l | grep -i nvidia is:

ii  bbswitch-dkms                              0.8-3ubuntu1                                 amd64        Interface for toggling the power on NVIDIA Optimus video cards
ii  libcuda1-410                               410.78-0ubuntu0~gpu16.04.1                   amd64        NVIDIA CUDA runtime library
hi  nvidia-410                                 410.78-0ubuntu0~gpu16.04.1                   amd64        NVIDIA binary driver - version 410.78
ii  nvidia-opencl-icd-410                      410.78-0ubuntu0~gpu16.04.1                   amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                               0.8.2                                        amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                            418.56-0ubuntu0~gpu16.04.1                   amd64        Tool for configuring the NVIDIA graphics driver

odd that nvidia settings is 418, but anyway it works. Also I used

sudo apt-mark hold nvidia-410

to make sure the driver won’t update with sudo apt-get update.

ptrblck · August 28, 2019, 2:24pm

Good to hear it’s working!

Yangmei_Shen · October 17, 2020, 1:56pm

Thanks God! That helps me.