Pytorch finds cuda despite nvcc not found?

Pytorch sees cuda and runs well on GPU, but nvcc appears to be not found

import torch
torch.cuda.is_available()  # True
torch.cuda.device_count()   #1
torch.cuda.current_device()  # 0
torch.cuda.get_device_name(0) # NVIDIA GeForce RTX 3090
$ nvidia-smi     
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 68%   72C    P2   341W / 420W |   9876MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    234092      C   python3                          9871MiB |
+-----------------------------------------------------------------------------+
$nvcc --version
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

Once I run $sudo apt install nvidia-cuda-toolkit, nvidia-smi gets removed and pytorch can’t recongize GPU. To install nvidia-smi, I run $sudo apt install nvidia-utils-515-server, but meanwhile nvcc get uninstalled. This looks like a chicken egg problem.

$sudo updatedb; locate nvcc
/etc/nvcc.profile
~/.local/lib/python3.10/site-packages/torch/share/cmake/Caffe2/Modules_CUDA_fix/upstream/FindCUDA/run_nvcc.cmak
...

OS: Ubuntu server 22.04

1 Like

The PyTorch binaries ship with their own CUDA runtime (as well as cuDNN, NCCL etc.) and don’t need a locally installed CUDA toolkit to execute code but only a properly installed NVIDIA driver.
Your local CUDA toolkit (with the compiler) will be used if you build PyTorch from source or a custom CUDA extension.
Based on your described issue, I guess your CUDA toolkit installation uninstalled the NVIDIA driver as well and/or broke it, so try a new full install and make sure CUDA applications work again.

2 Likes

I started off by installed the new pytorch and I can see in the torch.version.cuda that it is using cuda 11.7 version. But nvcc command is not found. Also when I try to install other packages which requires CUDA_HOME to be set I get error saying (OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root).
How should I get around this?
Should I install the nvidia-cuda-toolkit?
When I downloaded the cuda toolkit 11.7 and tried to install I get the message

Existing package manager installation of the driver found. It is strongly  
│ recommended that you remove this before continuing.                     
│ Abort                                                                      
│ Continue                                                                   

The PyTorch binaries ship with their required CUDA runtime dependencies, not a full CUDA toolkit with a compiler. If you want to build PyTorch from source or a custom CUDA extension you would need to install the full CUDA toolkit locally.

I am not building Pytorch from source. Just some other repositories that require CUDA.
When I try to install the full CUDA toolkit I get this message from the installer

Existing package manager installation of the driver found. It is strongly  
│ recommended that you remove this before continuing.                     
│ Abort                                                                      
│ Continue     

Should I uninstall the cuda runtime by pytorch along with pytorch and then install the full toolkit and then pytorch again? Or is there a way to keep things intact and just install the remaining ones of the cudatoolkit.

No, you don’t need to uninstall any PyTorch binaries or their dependencies and the warning is raised because of your already locally installed CUDA toolkit and driver.

When I try to install using

sudo sh cuda_11.7.0_515.43.04_linux.run

I downloaded the correct versions from nvidia.
I get

[INFO]: Driver installation detected by command: apt list --installed | grep -e nvidia-driver-[0-9][0-9][0-9] -e nvidia-[0-9][0-9][0-9]
[INFO]: Cleaning up window
[INFO]: Complete
[INFO]: Checking compiler version...
[INFO]: gcc location: /usr/bin/gcc

[INFO]: gcc version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

[INFO]: Initializing menu
[INFO]: Setup complete
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 515.43.04
[INFO]: Executing NVIDIA-Linux-x86_64-515.43.04.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd  2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 515.43.04 failed, quitting

is there a way to get an nvcc installed and running for the cuda bound to the pytorch you guys ship (in conda here specifically). Some libraries I’m using I think rely on nvcc to check for cuda support so that would help

You could install the matching CUDA toolkit from the NVIDIA website or could try to use the conda package.
Also, checking for nvcc for CUDA support is wrong as these packages check for a build toolchain. Which packages have these checks without using nvcc?

There’s a pretty novice developed packaged for performant poisson blending called fpie. Upon further inspection I suspect this might be doing the nvcc check more at the C++ layer since it seems like the python is just an abstraction on top so there might be bigger problems tbh.