PyTorch source build on POWER8 with CUDA 11.5 fails

I am trying to build PyTorch 1.9.0 from source on a POWER8 machine with CUDA 11.5 and Python 3.8 compatibility. As far as I understand there are no binaries/build configurations for this setup so I have been trying to find a workaround. My approach is to follow the same guidelines as in the From Source section on the Pytorch repository, except I run git checkout tags/v1.9.0 prior to syncing/updating the submodules. Then I create a conda environment and run:

$ conda install numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses

Note that I do not install mkl as I am building on a ppc64le architecture. I then have to separately install magma via the compass channel, which has a CUDA 11.2 compatible version. Then I export the following environment variables:

export PATH=/usr/local/cuda-11.5/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.5/lib64:$LD_LIBRARY_PATH
export CC=/usr/bin/gcc
export CXX=/usr/bin/g++
export USE_CUDA="True"

If I try to run the setup.py script at this point I get multiple errors about cub namespace bugs and conflicts with thrust, which seem to be addressed in later commits. To address these I followed the updates made in this commit, namely by creating the caffe2/utils/cub_namespace.cuh script and adjusting the include statements within the relevant caffe2 scripts, as well as adjusting the cmake/Dependencies.cmake.

After updating the git submodules, I run the setup script:

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} 
BUILD_TEST=0 USE_SYSTEM_NCCL=1 python setup.py install

This appears to address the aforementioned issues and the build almost completes but around step 3080/3100 I get the following error:

FAILED: bin/torch_shm_manager 
: && /usr/bin/g++ -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow -DHAVE_VSX_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic    -rdynamic caffe2/torch/lib/libshm/CMakeFiles/torch_shm_manager.dir/manager.cpp.o -o bin/torch_shm_manager  -Wl,-rpath,/home/mac/pytorch/build/lib:/home/mac/miniconda3/envs/aml_env/lib:/usr/local/cuda-11.5/lib64:/usr/local/cuda-11.5/lib:  lib/libshm.so  -lrt  lib/libtorch.so  -Wl,--no-as-needed,"/home/mac/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  -pthread  -Wl,--no-as-needed,"/home/mac/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  /usr/local/cuda-11.5/lib64/libcudart.so  /home/mac/miniconda3/envs/aml_env/lib/libnvToolsExt.so  /usr/local/cuda-11.5/lib64/libcufft.so  /usr/local/cuda-11.5/lib64/libcurand.so  /usr/local/cuda-11.5/lib64/libcublas.so  /usr/local/cuda-11.5/lib/libcudnn.so  lib/libc10.so && :
/usr/local/cuda-11.5/lib64/libcublas.so: undefined reference to `cublasLtGetStatusString@libcublasLt.so.11'
/usr/local/cuda-11.5/lib64/libcublas.so: undefined reference to `cublasLtGetStatusName@libcublasLt.so.11'
collect2: error: ld returned 1 exit status
[3084/3110] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/autograd/generated/python_torch_functions.cpp.o
[3085/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_group_spatial_softmax_op.cu.o
[3086/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_smooth_l1_loss_op.cu.o
[3087/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_ps_roi_pool_op.cu.o
[3088/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_upsample_nearest_op.cu.o
[3089/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_sigmoid_focal_loss_op.cu.o
[3090/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_select_smooth_l1_loss_op.cu.o
[3091/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_sample_as_op.cu.o
[3092/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_roi_pool_f_op.cu.o
[3093/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_spatial_narrow_as_op.cu.o
[3094/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_softmax_focal_loss_op.cu.o
[3095/3110] Building NVCC (Device) object modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir/caffe2_detectron_ops_gpu_generated_sigmoid_cross_entropy_loss_op.cu.o
[3096/3110] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/autograd/generated/python_functions.cpp.o
ninja: build stopped: subcommand failed.
Building wheel torch-1.9.0a0+gitd69c22d
-- Building version 1.9.0a0+gitd69c22d
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=False -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/mac/pytorch/torch -DCMAKE_PREFIX_PATH=/home/mac/miniconda3/envs/aml_env -DNUMPY_INCLUDE_DIR=/home/mac/miniconda3/envs/aml_env/lib/python3.8/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/home/mac/miniconda3/envs/aml_env/bin/python -DPYTHON_INCLUDE_DIR=/home/mac/miniconda3/envs/aml_env/include/python3.8 -DPYTHON_LIBRARY=/home/mac/miniconda3/envs/aml_env/lib/libpython3.8.so.1.0 -DTORCH_BUILD_VERSION=1.9.0a0+gitd69c22d -DUSE_CUDA=True -DUSE_NUMPY=True -DUSE_SYSTEM_NCCL=1 /home/mac/pytorch
cmake --build . --target install --config Release -- -j 160

Here is an overview of the environment variables and locations/ versions of relevant build tools/ libraries:

-- ******** Summary ********
-- General:
--   CMake version         : 3.19.6
--   CMake command         : /home/mac/miniconda3/envs/aml_env/bin/cmake
--   System                : Linux
--   C++ compiler          : /usr/bin/g++
--   C++ compiler id       : GNU
--   C++ compiler version  : 8.5.0
--   Using ccache if found : ON
--   Found ccache          : CCACHE_PROGRAM-NOTFOUND
--   CXX flags             :  -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow
--   Build type            : Release
--   Compile definitions   : ONNX_ML=1;ONNXIFI_ENABLE_EXT=1;ONNX_NAMESPACE=onnx_torch;MAGMA_V2;HAVE_MMAP=1;_FILE_OFFSET_BITS=64;HAVE_SHM_OPEN=1;HAVE_SHM_UNLINK=1;HAVE_MALLOC_USABLE_SIZE=1;USE_EXTERNAL_MZCRC;MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS
--   CMAKE_PREFIX_PATH     : /home/mac/miniconda3/envs/aml_env;/usr/local/cuda-11.5;/usr/local/cuda-11.5
--   CMAKE_INSTALL_PREFIX  : /home/mac/pytorch/torch
--   USE_GOLD_LINKER       : OFF
-- 
--   TORCH_VERSION         : 1.9.0
--   CAFFE2_VERSION        : 1.9.0
--   BUILD_CAFFE2          : ON
--   BUILD_CAFFE2_OPS      : ON
--   BUILD_CAFFE2_MOBILE   : OFF
--   BUILD_STATIC_RUNTIME_BENCHMARK: OFF
--   BUILD_TENSOREXPR_BENCHMARK: OFF
--   BUILD_BINARY          : OFF
--   BUILD_CUSTOM_PROTOBUF : ON
--     Link local protobuf : ON
--   BUILD_DOCS            : OFF
--   BUILD_PYTHON          : True
--     Python version      : 3.8.12
--     Python executable   : /home/mac/miniconda3/envs/aml_env/bin/python
--     Pythonlibs version  : 3.8.12
--     Python library      : /home/mac/miniconda3/envs/aml_env/lib/libpython3.8.so.1.0
--     Python includes     : /home/mac/miniconda3/envs/aml_env/include/python3.8
--     Python site-packages: lib/python3.8/site-packages
--   BUILD_SHARED_LIBS     : ON
--   CAFFE2_USE_MSVC_STATIC_RUNTIME     : OFF
--   BUILD_TEST            : False
--   BUILD_JNI             : OFF
--   BUILD_MOBILE_AUTOGRAD : OFF
--   BUILD_LITE_INTERPRETER: OFF
--   INTERN_BUILD_MOBILE   : 
--   USE_BLAS              : 1
--     BLAS                : open
--   USE_LAPACK            : 1
--     LAPACK              : open
--   USE_ASAN              : OFF
--   USE_CPP_CODE_COVERAGE : OFF
--   USE_CUDA              : True
--     Split CUDA          : OFF
--     CUDA static link    : OFF
--     USE_CUDNN           : ON
--     CUDA version        : 11.5
--     cuDNN version       : 8.3.1
--     CUDA root directory : /usr/local/cuda-11.5
--     CUDA library        : /usr/local/cuda-11.5/lib64/stubs/libcuda.so
--     cudart library      : /usr/local/cuda-11.5/lib64/libcudart.so
--     cublas library      : /usr/local/cuda-11.5/lib64/libcublas.so
--     cufft library       : /usr/local/cuda-11.5/lib64/libcufft.so
--     curand library      : /usr/local/cuda-11.5/lib64/libcurand.so
--     cuDNN library       : /usr/local/cuda-11.5/lib/libcudnn.so
--     nvrtc               : /home/mac/miniconda3/envs/aml_env/lib/libnvrtc.so
--     CUDA include path   : /usr/local/cuda-11.5/include
--     NVCC executable     : /usr/local/cuda-11.5/bin/nvcc
--     NVCC flags          : -Xfatbin;-compress-all;-DONNX_NAMESPACE=onnx_torch;-gencode;arch=compute_60,code=sm_60;-Xcudafe;--diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl;-std=c++14;-Xcompiler;-fPIC;--expt-relaxed-constexpr;--expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__;-Xcompiler;-fPIC
--     CUDA host compiler  : /usr/bin/gcc
--     NVCC --device-c     : OFF
--     USE_TENSORRT        : OFF
--   USE_ROCM              : OFF
--   USE_EIGEN_FOR_BLAS    : ON
--   USE_FBGEMM            : OFF
--     USE_FAKELOWP          : OFF
--   USE_KINETO            : ON
--   USE_FFMPEG            : OFF
--   USE_GFLAGS            : OFF
--   USE_GLOG              : OFF
--   USE_LEVELDB           : OFF
--   USE_LITE_PROTO        : OFF
--   USE_LMDB              : OFF
--   USE_METAL             : OFF
--   USE_PYTORCH_METAL     : OFF
--   USE_FFTW              : OFF
--   USE_MKL               : OFF
--   USE_MKLDNN            : OFF
--   USE_NCCL              : ON
--     USE_SYSTEM_NCCL     : 1
--   USE_NNPACK            : OFF
--   USE_NUMPY             : ON
--   USE_OBSERVERS         : ON
--   USE_OPENCL            : OFF
--   USE_OPENCV            : OFF
--   USE_OPENMP            : ON
--   USE_TBB               : OFF
--   USE_VULKAN            : OFF
--   USE_PROF              : OFF
--   USE_QNNPACK           : OFF
--   USE_PYTORCH_QNNPACK   : OFF
--   USE_REDIS             : OFF
--   USE_ROCKSDB           : OFF
--   USE_ZMQ               : OFF
--   USE_DISTRIBUTED       : ON
--     USE_MPI             : OFF
--     USE_GLOO            : ON
--     USE_TENSORPIPE      : ON
--   USE_DEPLOY           : OFF
--   Public Dependencies  : Threads::Threads
--   Private Dependencies : cpuinfo;fp16;gloo;tensorpipe;aten_op_header_gen;foxi_loader;rt;fmt::fmt-header-only;kineto;gcc_s;gcc;dl

Any idea on what the problem could be? Thanks in advance!

The build seems to fail with:

/usr/local/cuda-11.5/lib64/libcublas.so: undefined reference to `cublasLtGetStatusString@libcublasLt.so.11'

Could you check the symbol via nm -gD libcublas.so (you should see that it’s undefined), make sure libcublas.so links to libcublasLt.so via ldd, and then check that the symbol is defined in libcublasLt.so?
Based on the error I guess that either libcublasLt.so might be missing (or at least the link is broken).

When I run nm -gD libcublas.so I see that both cublasLtGetStatusName and cublasLtGetStatusString are undefined (U). When I run ldd libcublas.so I get the following:

	linux-vdso64.so.1 (0x00007fff80680000)
	libcublasLt.so.11 => /usr/local/cuda-11.5/lib64/libcublasLt.so.11 (0x00007fff625f0000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fff62590000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fff62560000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fff62530000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fff623f0000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fff623b0000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fff62190000)
	/lib64/ld64.so.2 (0x00007fff806a0000)

When I run nm -gD libcublasLt.so I see that both cublasLtGetStatusName and cublasLtGetStatusString are defined (T). So these are both defined in libcublasLt.so and libcublas.so seems to be linked to it, yet they are still undefined in the latter.

As a side note, when I check the pytorch/build/lib directory and run ldd on some of the libraries there, I see that some of them point to libcublas.so from my conda environment but libcublasLt.so from my usr/local/cuda directory. Could this be a source of the problem?

I don’t know, but it could cause some issues, if the minconda libcublas.so library doesn’t link to the needed symbols. I don’t know where this lib is coming from but I guess you are mixing up the conda cudatoolkit with a local CUDA toolkit installation, so maybe create a new clean virtual env and try to rebuild it.

For what it’s worth, I had this exactly same linker error on x86-64 when building a C++ program against libtorch. It did seem to be related to picking up bad CUDA libraries from the conda environment: my system CUDA was 11.5 (plenty new).

I was able to resolve the issue by switching from using the libtorch from the conda installation to just downloading libtorch separately and using that in my cmake command. This is obviously not a solution for the original problem here, but wanted to mention it in case anyone else gets to this page like I did searching the error in the context of libtorch.