Couldn't find CUDA library root

Need help finding what’s actually causing the cmake failure; build fails wth this msg despite finding the CUDA root and correctly populating the cmake cache with the root and toolkit_root and associated libs. CMake error log ends with failed ninja tests for alternatives immediately prior to this (testing avx512f support for fbgemm build), no indication there or in output log as to what’s actually failing. This is not a repeat of #23066 or similar, the cmake runtime is actually locating the cuda/cudart/cudnn/etc root/toolkitroot/etc correctly per the ccache.

CUDA 11.4.2-1 el7 x86_6 from NVida repo with all the usual -nn, -rt, -nr, parser, solver, blas, lapack, etc, gubbins installed. GCC 11.2.0 and python 3.10.0 built from source in /opt/soft/, resulting pip3 used to install all prereqs in /opt/soft[/lib/python]. Suggestion to move cuda to raw build in /usr/local is not permitted, cuda must be installed via nvidia repo packages keysigned gpg-pubkey-f90c0e97-483e8383 and installing into /usr/include/cuda and /lib[64].

Build procedure is:

umask 022
export PATH="/opt/soft/bin:$PATH"
export PYTHONPATH=/opt/soft/lib/python:$PYTHONPATH
export LD_LIBRARY_PATH="/opt/soft/libexec/gcc/x86_64-pc-linux-gnu/11.2.0:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/opt/soft/lib/gcc/x86_64-pc-linux-gnu/11.2.0/plugin:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/opt/soft/lib/gcc/x86_64-pc-linux-gnu/11.2.0:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/opt/soft/lib64:$LD_LIBRARY_PATH"
git clone --recursive GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration
cd pytorch
export CMAKE_PREFIX_PATH=/opt/soft
export ROCM=0
export CUDA_HOME=/
python3 ./ build


Cmake cache:

Output log:

Error log:

Yes those are pastebins, “new users can only…”, etc.

Tried a few other things, found this in CMakeDetermineCUDACompiler.cmake:

                    # CMAKE_CUDA_COMPILER_LIBRARY_ROOT contains the device library.
                    if(EXISTS "${CMAKE_CUDA_COMPILER_TOOLKIT_ROOT}/nvvm/libdevice")
                    elseif(CMAKE_SYSROOT_LINK AND EXISTS "${CMAKE_SYSROOT_LINK}/usr/lib/cuda/nvvm/libdevice")
                      set(CMAKE_CUDA_COMPILER_LIBRARY_ROOT "${CMAKE_SYSROOT_LINK}/usr/lib/cuda")
                    elseif(EXISTS "${CMAKE_SYSROOT}/usr/lib/cuda/nvvm/libdevice")
                      set(CMAKE_CUDA_COMPILER_LIBRARY_ROOT "${CMAKE_SYSROOT}/usr/lib/cuda")
                      message(FATAL_ERROR "Couldn't find CUDA library root.")

So evidently it’s static hardcoded to only look in lib/, not also check lib64/. If you install the cuda toolkit(s) via canonical NVidia packages and it’s an x86_64 box without multilib or compat for one of the 32bit variants, then don’t the libs go exclusively in /usr/lib64?

I don’t think that’s the case as my setup also doesn’t have the lib folder.
Changing the location of the CUDA toolkit works fine, too:

ls /usr/local/cuda
NsightSystems-cli-2021.3.2  bin  compat  compute-sanitizer  extras  include  lib64  nvml  nvvm  share  src  targets

ln -s /usr/local/cuda ./mylocalcuda

CUDA_HOME=/workspace/src/mylocalcuda python develop

Output from the install log:

-- Caffe2: CUDA detected: 11.5
-- Caffe2: CUDA nvcc is: /workspace/src/mylocalcuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /workspace/src/mylocalcuda
--     CUDA version        : 11.5
--     CUDA root directory : /workspace/src/mylocalcuda
--     CUDA library        : /workspace/src/mylocalcuda/lib64/stubs/
--     cudart library      : /workspace/src/mylocalcuda/lib64/
--     cublas library      : /workspace/src/mylocalcuda/lib64/
--     cufft library       : /workspace/src/mylocalcuda/lib64/
--     curand library      : /workspace/src/mylocalcuda/lib64/
--     nvrtc               : /workspace/src/mylocalcuda/lib64/
--     CUDA include path   : /workspace/src/mylocalcuda/include
--     NVCC executable     : /workspace/src/mylocalcuda/bin/nvcc

If I hardcode the

set(CMAKE_CUDA_COMPILER_LIBRARY_ROOT "/opt/soft/cuda/11.4.2-1")

in CMakeDetermineCUDACompiler.cmake to force it past that point just to see what breaks, it fails the nvcc “simple test program” precheck. Manually repeating that precheck succeeds. I don’t get this unless the cuda install is relocated so… bad cuda relocation? Both /usr/local and /opt/soft/cuda/{version}/ cuda installs successfully build and run their samples. Same thing if both installs coexist or are on different login node images, so it’s not an intersect. The 11.4.2-1 was prepackage, bet there’s something wrong with it’s install despite the successful samples. Either that or something with the combo of 11.4.2-1 and python 3.10.0 for the pytorch build?

What install method for your cuda, the one in /workspace/src? Tarball, “raw” installer, package, module, other?

What python?

I don’t think Python==3.10 is supported yet and the implementation is tracked here.

Local installer via the file, Python=3.8.

Ah, right you are. 3.8.6 looks like the popular choice. Right, modulefile for the python version, 3.10.0 for the other toolsets, 3.8.6 for pycuda, should do it. Thanks!