$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 68% 72C P2 341W / 420W | 9876MiB / 24576MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 234092 C python3 9871MiB |
+-----------------------------------------------------------------------------+
$nvcc --version
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
Once I run $sudo apt install nvidia-cuda-toolkit, nvidia-smi gets removed and pytorch can’t recongize GPU. To install nvidia-smi, I run $sudo apt install nvidia-utils-515-server, but meanwhile nvcc get uninstalled. This looks like a chicken egg problem.
The PyTorch binaries ship with their own CUDA runtime (as well as cuDNN, NCCL etc.) and don’t need a locally installed CUDA toolkit to execute code but only a properly installed NVIDIA driver.
Your local CUDA toolkit (with the compiler) will be used if you build PyTorch from source or a custom CUDA extension.
Based on your described issue, I guess your CUDA toolkit installation uninstalled the NVIDIA driver as well and/or broke it, so try a new full install and make sure CUDA applications work again.
I started off by installed the new pytorch and I can see in the torch.version.cuda that it is using cuda 11.7 version. But nvcc command is not found. Also when I try to install other packages which requires CUDA_HOME to be set I get error saying (OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root).
How should I get around this?
Should I install the nvidia-cuda-toolkit?
When I downloaded the cuda toolkit 11.7 and tried to install I get the message
Existing package manager installation of the driver found. It is strongly
│ recommended that you remove this before continuing.
│ Abort
│ Continue
The PyTorch binaries ship with their required CUDA runtime dependencies, not a full CUDA toolkit with a compiler. If you want to build PyTorch from source or a custom CUDA extension you would need to install the full CUDA toolkit locally.
I am not building Pytorch from source. Just some other repositories that require CUDA.
When I try to install the full CUDA toolkit I get this message from the installer
Existing package manager installation of the driver found. It is strongly
│ recommended that you remove this before continuing.
│ Abort
│ Continue
Should I uninstall the cuda runtime by pytorch along with pytorch and then install the full toolkit and then pytorch again? Or is there a way to keep things intact and just install the remaining ones of the cudatoolkit.
No, you don’t need to uninstall any PyTorch binaries or their dependencies and the warning is raised because of your already locally installed CUDA toolkit and driver.
is there a way to get an nvcc installed and running for the cuda bound to the pytorch you guys ship (in conda here specifically). Some libraries I’m using I think rely on nvcc to check for cuda support so that would help
You could install the matching CUDA toolkit from the NVIDIA website or could try to use the conda package.
Also, checking for nvcc for CUDA support is wrong as these packages check for a build toolchain. Which packages have these checks without using nvcc?
There’s a pretty novice developed packaged for performant poisson blending called fpie. Upon further inspection I suspect this might be doing the nvcc check more at the C++ layer since it seems like the python is just an abstraction on top so there might be bigger problems tbh.