Building pytorch from source in a conda environment detects wrong cuda

Hello everyone,

I’m trying to build pytorch from source following the official documentation. I’m on a universities cluster and thus use conda to have control over my environment. I installed magma-cuda101 and cudatoolkit=10.1. The whole install-command within a so far empty environment is

conda install -c conda-forge -c pytorch -c nvidia magma-cuda101 mkl-include mkl gcc_linux-64 cxx-compiler numpy pyyaml setuptools cmake cffi python cudatoolkit=10.1

But if i try python setup.py install the following happens:

-- Found CUDA: /usr/local/cuda (found version "8.0") 
-- Caffe2: CUDA detected: 8.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
CMake Error at cmake/public/cuda.cmake:42 (message):
  PyTorch requires CUDA 9.0 and above.

So it does not find the proper cuda version. nvidia-smi tells me cuda 10.1 is available. However I cannot find a corresponding folder in /usr/. Right now I’m wondering how to find the origin of the cuda 10.1 reported by nvidia-smi and how to build pytorch against it.
Best Regards
Scipio

Hi,

Did you setup the proper CMAKE_PREFIX_PATH before running the install command? Is the nvcc that you get when you do which nvcc the one from conda?

The CMAKE_PREFIX_PATH is set properly. When I try to run nvcc from the installation described above no nvcc is found. If I add nvcc_linux-64 to the environment the last output of the installation is

Cannot determine CUDA_HOME: cuda-gdb not in PATH

Subsequently, which nvcc yields the one in /conda_env/bin/ but running it returns

/conda_env/bin/nvcc: line 2: /bin/nvcc: No such file or directory

This is no surprise as the nvcc in the env itself is nothing but a shell script pointing to the system-wide nvcc. I could set CUDA_HOME manually if I knew where to look for the proper version.

It looks like the cuda in your env is not properly installed. It should contain a full install that is independent to the system wide one. Especially if your system-wide cuda is not the same version as the one in conda.

Ok, I will reinstall the whole env once again and report back

1 Like

Also if you can try cuda samples (if that exist on conda) or other very simple cuda package, that will make sure the cuda install is done properly.

So reinstalling brought no change. A short search looks like cuda samples are neither shipped with the cuda toolkit nor available as a package in conda. I’m wondering, is the cuda version reported by nvidia-smi just the highest one supported by the driver itself or does it reside somewhere on the system? Or should CUDA_HOME somehow point to my environment?

Does nvidia-smi reports a cuda version? It only reports the driver version from what I remember. Cuda is independent of that.

You can set CUDA_HOME before running the pytorch install to point to the conda install. But you should not need to do that…

According to https://stackoverflow.com/questions/53422407/different-cuda-versions-shown-by-nvcc-and-nvidia-smi Nvidia-smi reports which is the highest CUDA version that can be used with the installed driver. So the output of nvidia-smi actually has little to do with my problem. However, I’m going to go on working on this tomorrow. I hope I can provide an answer for anyone stumbling upon this thread within the next few days.

By adding cudatoolkit-dev to the list of installed packages I got a proper nvcc in my environment. Now I moved on to the next problem:

File /path_to_conda/miniconda3/envs/pytorch_build/lib64/stubs/libcuda.so doesn't exist

So now I’m trying to figure out how to get this file. It looks like there’s still something wrong/missing with my cuda-installation. Is magma-cuda101 the relevant package or what am I looking for?

I found /usr/lib/x86_64-linux-gnu/libcuda.so and created a symlink in /path_to_conda/miniconda3/envs/pytorch_build/lib64/stubs/ but as was to be expected this just led to another message

grep: /path_to_conda/miniconda3/envs/pytorch_build/version.txt: No such file or directory

And if I try to build pytorch I’m back at the original error, although which nvcc now yields the one within the conda environment.

You should not need to do that… There is definitely something not right here.
Have you tried to set the CUDA_HOME to the cuda version in conda, and the PATH to make sure that the nvcc is the one from conda. (the real one that was installed! not a symlink to the system one that is cuda 8.0).

Hello again,
Sorry for taking so long but construction workers damaged the clusters power supply and I couldn’t access the system for the past nine days. Now I am trying again and still encounter the problem from the first post. I am wondering what the proper value for CUDA_HOME would be. I tried /miniconda3/envs/pytorch_build/pkgs/cuda-toolkit/include/thrust/system/cuda/ and /miniconda3/envs/pytorch_build/bin/ but neither did the trick.

Your CUDA_HOME should be such that CUDA_HOME/bin/nvcc can be found and CUDA_HOME/lib64/* contains all the cuda shared libraries.