Build fails on OS X

Pekka · November 9, 2017, 10:13am

I’m trying to build Pytorch, but for some reason it fails consistently on same stage. CUDA 8.0.90 and cuDNN 6.0 installed. Clang version below:

Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Environment for build:
declare -x CMAKE_PREFIX_PATH="/adaconda2/bin/conda"
declare -x CUDA_HOME="/Developer/NVIDIA/CUDA-8.0"
declare -x CUDNN_INCLUDE_DIR="/Developer/NVIDIA/CUDA-8.0/include"
declare -x CUDNN_LIB_DIR="/Developer/NVIDIA/CUDA-8.0/lib"
declare -x DYLD_LIBRARY_PATH="/Developer/NVIDIA/CUDA-8.0/lib"

Build command:
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install

Build error message:
-clip-
clang -bundle -undefined dynamic_lookup -L/anaconda2/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.6-x86_64-2.7/torch/csrc/nvrtc.o -L/Users/pekka/src/pytorch/torch/lib -L/Developer/NVIDIA/CUDA-8.0/lib -L/Developer/NVIDIA/CUDA-8.0/lib -L/Developer/NVIDIA/CUDA-8.0/lib/stubs -L/anaconda2/lib -o build/lib.macosx-10.6-x86_64-2.7/torch/_nvrtc.so -Wl,-rpath,/Developer/NVIDIA/CUDA-8.0/lib -Wl,-rpath,/Developer/NVIDIA/CUDA-8.0/lib -Wl,-rpath,@loader_path/lib -lcuda -lnvrtc
ld: library not found for -lcuda
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command ‘clang’ failed with exit status 1

I’m running OS X 10.11.6 on mid-2014 Macbook Pro with NVIDIA 650M GPU.

Pekka · November 13, 2017, 2:38pm

Obviously nobody uses OS X version of Pytorch?

Pekka · November 15, 2017, 12:47pm

Had another try with CUDA 9.0 and CUDNN 7.0. Environment variables:
CMAKE_PREFIX_PATH="/usr/local/miniconda3"
CUDA_HOME="/Developer/NVIDIA/CUDA-9.0"

and build command:
MACOSX_DEPLOYMENT_TARGET=10.9 WITH_CUDA=1 WITH_CUDNN=1 CC=clang CXX=clang++ python setup.py install > build.log

Failing again with exactly same bloody linker error “-lcuda” not found. Either the installation script sucks or the installation instructions are bloody outdated. Not very impressed with Pytorch nor support on this forum yet.

Complete build log: https://pastebin.com/6MhuCY0w

smth · November 15, 2017, 3:01pm

OSX and CUDA are quite hard to support, we’ve had very few people using it, and XCode versions + CUDA compatibility keeps changing.

libcuda.dylib (i.e. -lcuda not found) comes with the NVIDIA driver. Do you have an NVIDIA driver installed on your system?

smth · November 15, 2017, 3:03pm

Also, this directory /Developer/NVIDIA/CUDA-8.0/lib/stubs is supposed to have -lcuda and -lnvrtc, not sure why it’s missing i can check.

smth · November 15, 2017, 9:00pm

It turns out that the stubs on OSX are installed to /usr/local/cuda/lib, so I sent a PR to fix this: https://github.com/pytorch/pytorch/pull/3722

Pekka · November 16, 2017, 8:38am

That would explain why it was not found though I had all paths setup according the instructions. Thanks for quick reply and apologies for my hard tone earlier. I will pull your fix and try to build again asap.

Pekka · November 23, 2017, 12:08pm

OK, got now a bit further with the fixes . But now it fails while trying to link this lib:

[ 93%] Linking CXX shared library libATen.dylib Undefined symbols for architecture x86_64: "std::runtime_error::what() const", referenced from: thrust::system::system_error::what() const in ATen_generated_THCStorage.cu.o thrust::system::system_error::what() const in ATen_generated_THCTensorMath.cu.o thrust::system::system_error::what() const in ATen_generated_THCTensorMathScan.cu.o thrust::system::system_error::what() const in ATen_generated_THCTensorIndex.cu.o thrust::system::system_error::what() const in ATen_generated_THCTensorMode.cu.o thrust::system::system_error::what() const in ATen_generated_THCTensorSortByte.cu.o thrust::system::system_error::what() const in ATen_generated_THCTensorMaskedByte.cu.o ... "std::__1::__vector_base_common<true>::__throw_length_error() const", referenced from: at::infer_size(at::ArrayRef<long long>, at::ArrayRef<long long>) in ExpandUtils.cpp.o at::inferExpandGeometry(at::Tensor const&, at::ArrayRef<long long>) in ExpandUtils.cpp.o at::ArrayRef<long long>::vec() const in ExpandUtils.cpp.o at::__printTensor(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, at::Tensor&, long long) in Formatting.cpp.o at::native::permute(at::Tensor const&, at::ArrayRef<long long>) in NativeFunctions.cpp.o at::native::inferSqueezeGeometry(at::Tensor const&) in NativeFunctions.cpp.o at::native::inferSqueezeGeometry(at::Tensor const&, long long) in NativeFunctions.cpp.o ...

I even tried to build without CUDA with NO_CUDA=1, but got similar error:

[ 80%] Linking CXX shared library libATen.dylib Undefined symbols for architecture x86_64: "std::__1::__vector_base_common<true>::__throw_length_error() const", referenced from: at::infer_size(at::ArrayRef<long long>, at::ArrayRef<long long>) in ExpandUtils.cpp.o at::inferExpandGeometry(at::Tensor const&, at::ArrayRef<long long>) in ExpandUtils.cpp.o at::ArrayRef<long long>::vec() const in ExpandUtils.cpp.o at::__printTensor(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, at::Tensor&, long long) in Formatting.cpp.o at::native::permute(at::Tensor const&, at::ArrayRef<long long>) in NativeFunctions.cpp.o at::native::inferSqueezeGeometry(at::Tensor const&) in NativeFunctions.cpp.o at::native::inferSqueezeGeometry(at::Tensor const&, long long) in NativeFunctions.cpp.o ... "std::__1::__basic_string_common<true>::__throw_length_error() const", referenced from: std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str() const in Context.cpp.o std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str() const in ExpandUtils.cpp.o std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str() const in NativeFunctions.cpp.o std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str() const in CPUByteType.cpp.o std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str() const in CPUCharType.cpp.o std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str() const in CPUDoubleType.cpp.o std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str() const in CPUFloatType.cpp.o ...

Tools and environment vars I’m using are as follows:

XCODE 8.3.2 & XCODE Command Line Tools v. 8.2 for macOS 10.12 (Sierra)
PATH="/usr/local/cuda/bin:/usr/local/miniconda3/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
CMAKE_PREFIX_PATH=/usr/local/miniconda3
DYLD_LIBRARY_PATH="/usr/local/cuda/lib"

CUDA 9.0 driver
CUDA 8.0 toolkit with patches
cuDNN 6.0 for CUDA 8.0

Build command:

MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py build RESULT > build1.log 2>&1

I also tried to downgrade to CUDA driver 8.0 and build then with cuda, but no difference.

What to try next? Could it be that Xcode 8.x.x toolchain is just not compatible with macOS Sierra? The error “undefined symbols” sound a bit like a compatibility problem with the toolchain. As said my experience on Apple’s toolchain is limited so all hints how to debug this are appreciated!