Need help finding what’s actually causing the cmake failure; build fails wth this msg despite finding the CUDA root and correctly populating the cmake cache with the root and toolkit_root and associated libs. CMake error log ends with failed ninja tests for alternatives immediately prior to this (testing avx512f support for fbgemm build), no indication there or in output log as to what’s actually failing. This is not a repeat of #23066 or similar, the cmake runtime is actually locating the cuda/cudart/cudnn/etc root/toolkitroot/etc correctly per the ccache.
CUDA 11.4.2-1 el7 x86_6 from NVida repo with all the usual -nn, -rt, -nr, parser, solver, blas, lapack, etc, gubbins installed. GCC 11.2.0 and python 3.10.0 built from source in /opt/soft/, resulting pip3 used to install all prereqs in /opt/soft[/lib/python]. Suggestion to move cuda to raw build in /usr/local is not permitted, cuda must be installed via nvidia repo packages keysigned gpg-pubkey-f90c0e97-483e8383 and installing into /usr/include/cuda and /lib[64].
Tried a few other things, found this in CMakeDetermineCUDACompiler.cmake:
# CMAKE_CUDA_COMPILER_LIBRARY_ROOT contains the device library.
if(EXISTS "${CMAKE_CUDA_COMPILER_TOOLKIT_ROOT}/nvvm/libdevice")
set(CMAKE_CUDA_COMPILER_LIBRARY_ROOT "${CMAKE_CUDA_COMPILER_TOOLKIT_ROOT}")
elseif(CMAKE_SYSROOT_LINK AND EXISTS "${CMAKE_SYSROOT_LINK}/usr/lib/cuda/nvvm/libdevice")
set(CMAKE_CUDA_COMPILER_LIBRARY_ROOT "${CMAKE_SYSROOT_LINK}/usr/lib/cuda")
elseif(EXISTS "${CMAKE_SYSROOT}/usr/lib/cuda/nvvm/libdevice")
set(CMAKE_CUDA_COMPILER_LIBRARY_ROOT "${CMAKE_SYSROOT}/usr/lib/cuda")
else()
message(FATAL_ERROR "Couldn't find CUDA library root.")
endif()
So evidently it’s static hardcoded to only look in lib/, not also check lib64/. If you install the cuda toolkit(s) via canonical NVidia packages and it’s an x86_64 box without multilib or compat for one of the 32bit variants, then don’t the libs go exclusively in /usr/lib64?
in CMakeDetermineCUDACompiler.cmake to force it past that point just to see what breaks, it fails the nvcc “simple test program” precheck. Manually repeating that precheck succeeds. I don’t get this unless the cuda install is relocated so… bad cuda relocation? Both /usr/local and /opt/soft/cuda/{version}/ cuda installs successfully build and run their samples. Same thing if both installs coexist or are on different login node images, so it’s not an intersect. The 11.4.2-1 was prepackage, bet there’s something wrong with it’s install despite the successful samples. Either that or something with the combo of 11.4.2-1 and python 3.10.0 for the pytorch build?
What install method for your cuda, the one in /workspace/src? Tarball, “raw” installer, package, module, other?
Ah, right you are. 3.8.6 looks like the popular choice. Right, modulefile for the python version, 3.10.0 for the other toolsets, 3.8.6 for pycuda, should do it. Thanks!