Building pytorch from source in a conda environment detects wrong cuda

Scipio · May 11, 2020, 2:06pm

Hello everyone,

I’m trying to build pytorch from source following the official documentation. I’m on a universities cluster and thus use conda to have control over my environment. I installed magma-cuda101 and cudatoolkit=10.1. The whole install-command within a so far empty environment is

conda install -c conda-forge -c pytorch -c nvidia magma-cuda101 mkl-include mkl gcc_linux-64 cxx-compiler numpy pyyaml setuptools cmake cffi python cudatoolkit=10.1

But if i try python setup.py install the following happens:

-- Found CUDA: /usr/local/cuda (found version "8.0") 
-- Caffe2: CUDA detected: 8.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
CMake Error at cmake/public/cuda.cmake:42 (message):
  PyTorch requires CUDA 9.0 and above.

So it does not find the proper cuda version. nvidia-smi tells me cuda 10.1 is available. However I cannot find a corresponding folder in /usr/. Right now I’m wondering how to find the origin of the cuda 10.1 reported by nvidia-smi and how to build pytorch against it.
Best Regards
Scipio

albanD · May 11, 2020, 2:30pm

Hi,

Did you setup the proper CMAKE_PREFIX_PATH before running the install command? Is the nvcc that you get when you do which nvcc the one from conda?

Scipio · May 11, 2020, 3:14pm

The CMAKE_PREFIX_PATH is set properly. When I try to run nvcc from the installation described above no nvcc is found. If I add nvcc_linux-64 to the environment the last output of the installation is

Cannot determine CUDA_HOME: cuda-gdb not in PATH

Subsequently, which nvcc yields the one in /conda_env/bin/ but running it returns

/conda_env/bin/nvcc: line 2: /bin/nvcc: No such file or directory

This is no surprise as the nvcc in the env itself is nothing but a shell script pointing to the system-wide nvcc. I could set CUDA_HOME manually if I knew where to look for the proper version.

albanD · May 11, 2020, 3:20pm

It looks like the cuda in your env is not properly installed. It should contain a full install that is independent to the system wide one. Especially if your system-wide cuda is not the same version as the one in conda.

Scipio · May 11, 2020, 3:22pm

Ok, I will reinstall the whole env once again and report back

albanD · May 11, 2020, 3:25pm

Also if you can try cuda samples (if that exist on conda) or other very simple cuda package, that will make sure the cuda install is done properly.

Scipio · May 11, 2020, 4:00pm

So reinstalling brought no change. A short search looks like cuda samples are neither shipped with the cuda toolkit nor available as a package in conda. I’m wondering, is the cuda version reported by nvidia-smi just the highest one supported by the driver itself or does it reside somewhere on the system? Or should CUDA_HOME somehow point to my environment?

albanD · May 11, 2020, 4:14pm

Does nvidia-smi reports a cuda version? It only reports the driver version from what I remember. Cuda is independent of that.

You can set CUDA_HOME before running the pytorch install to point to the conda install. But you should not need to do that…

Scipio · May 11, 2020, 8:15pm

According to https://stackoverflow.com/questions/53422407/different-cuda-versions-shown-by-nvcc-and-nvidia-smi Nvidia-smi reports which is the highest CUDA version that can be used with the installed driver. So the output of nvidia-smi actually has little to do with my problem. However, I’m going to go on working on this tomorrow. I hope I can provide an answer for anyone stumbling upon this thread within the next few days.

Scipio · May 12, 2020, 8:11am

By adding cudatoolkit-dev to the list of installed packages I got a proper nvcc in my environment. Now I moved on to the next problem:

File /path_to_conda/miniconda3/envs/pytorch_build/lib64/stubs/libcuda.so doesn't exist

So now I’m trying to figure out how to get this file. It looks like there’s still something wrong/missing with my cuda-installation. Is magma-cuda101 the relevant package or what am I looking for?

Scipio · May 12, 2020, 11:29am

I found /usr/lib/x86_64-linux-gnu/libcuda.so and created a symlink in /path_to_conda/miniconda3/envs/pytorch_build/lib64/stubs/ but as was to be expected this just led to another message

grep: /path_to_conda/miniconda3/envs/pytorch_build/version.txt: No such file or directory

And if I try to build pytorch I’m back at the original error, although which nvcc now yields the one within the conda environment.

albanD · May 12, 2020, 4:41pm

You should not need to do that… There is definitely something not right here.
Have you tried to set the CUDA_HOME to the cuda version in conda, and the PATH to make sure that the nvcc is the one from conda. (the real one that was installed! not a symlink to the system one that is cuda 8.0).

Scipio · May 22, 2020, 8:13pm

Hello again,
Sorry for taking so long but construction workers damaged the clusters power supply and I couldn’t access the system for the past nine days. Now I am trying again and still encounter the problem from the first post. I am wondering what the proper value for CUDA_HOME would be. I tried /miniconda3/envs/pytorch_build/pkgs/cuda-toolkit/include/thrust/system/cuda/ and /miniconda3/envs/pytorch_build/bin/ but neither did the trick.

albanD · May 26, 2020, 12:52am

Your CUDA_HOME should be such that CUDA_HOME/bin/nvcc can be found and CUDA_HOME/lib64/* contains all the cuda shared libraries.

Sanmi_Adeleye · October 31, 2020, 5:40am

This worked for me, posting for future reference.

export CUDA_HOME=/usr/local/cuda

Found here