Build Error[ninja: build stopped: subcommand failed] when building Pytorch from source

longxd · March 23, 2020, 3:51pm

I have a cuda 9.1 installed, and need to build a pytorch with the newest version (1.4). There is no pre-build binary for that, so I download the source code and try to build it. It fails. It says:

[38/4463] Performing build step for ‘nccl_external’
FAILED: nccl_external-prefix/src/nccl_external-stamp/nccl_external-build nccl/lib/libnccl_static.a
cd /usr/home/work/pytorch10/third_party/nccl/nccl && env CCACHE_DISABLE=1 SCCACHE_DISABLE=1 make CXX=/usr/local/bin/c++ CUDA_HOME=/usr/local/cuda NVCC=/usr/local/cuda/bin/nvcc NVCC_GENCODE=-gencode=arch=compute_60,code=sm_60 BUILDDIR=/usr/home/work/pytorch10/build/nccl VERBOSE=0 -j && /usr/home/work/anaconda3/envs/fromsrc1/bin/cmake -E touch /usr/home/work/pytorch10/build/nccl_external-prefix/src/nccl_external-stamp/nccl_external-build
make -C src build BUILDDIR=/usr/home/work/pytorch10/build/nccl

…

Generating rules > /usr/home/work/pytorch10/build/nccl/obj/collectives/device/Makefile.rules
/bin/sh: ./gen_rules.sh: /bin/bash^M: bad interpreter: No such file or directory
Compiling functions.cu > /usr/home/work/pytorch10/build/nccl/obj/collectives/device/functions.o
nvlink warning : Function ‘_Z25ncclBroadcastRing_copy_i8P14CollectiveArgs’ has address taken but no possible call to it
nvlink warning : Function ‘_Z27ncclBroadcastRingLL_copy_i8P14CollectiveArgs’ has address taken but no possible call to it
nvlink warning : Function ‘_Z25ncclBroadcastTree_copy_i8P14CollectiveArgs’ has address taken but no possible call to it
nvlink warning : Function ‘_Z27ncclBroadcastTreeLL_copy_i8P14CollectiveArgs’ has address taken but no possible call to it
nvlink warning : Function ‘_Z21ncclReduceRing_sum_i8P14CollectiveArgs’ has address taken but no possible call to it
nvlink warning : Function ‘_Z23ncclReduceRingLL_sum_i8P14CollectiveArgs’ has address taken but no possible call to it
nvlink warning : Function ‘_Z21ncclReduceTree_sum_i8P14CollectiveArgs’ has address taken but no possible call to it
nvlink warning : Function ‘_Z23ncclReduceTreeLL_sum_i8P14CollectiveArgs’ has address taken but no possible call to it
nvlink warning : Function ‘_Z21ncclReduceRing_sum_u8P14CollectiveArgs’ has address taken but no possible call to it
nvlink warning : Function '_Z23nccl

…

ninja: build stopped: subcommand failed.
Building wheel torch-1.5.0a0+358ba59
– Building version 1.5.0a0+358ba59
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/home/work/pytorch10/torch -DCMAKE_PREFIX_PATH=/home/work/anaconda3/envs/fromsrc1 -DNUMPY_INCLUDE_DIR=/home/work/anaconda3/envs/fromsrc1/lib/python3.8/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/home/work/anaconda3/envs/fromsrc1/bin/python -DPYTHON_INCLUDE_DIR=/home/work/anaconda3/envs/fromsrc1/include/python3.8 -DPYTHON_LIBRARY=/home/work/anaconda3/envs/fromsrc1/lib/libpython3.8.so.1.0 -DTORCH_BUILD_VERSION=1.5.0a0+358ba59 -DUSE_NUMPY=True /usr/home/work/pytorch10
cmake --build . --target install --config Release – -j 16
Traceback (most recent call last):
File “setup.py”, line 745, in
build_deps()
File “setup.py”, line 311, in build_deps
build_caffe2(version=version,
File “/usr/home/work/pytorch10/tools/build_pytorch_libs.py”, line 62, in build_caffe2
cmake.build(my_env)
File “/usr/home/work/pytorch10/tools/setup_helpers/cmake.py”, line 339, in build
self.run(build_args, my_env)
File “/usr/home/work/pytorch10/tools/setup_helpers/cmake.py”, line 141, in run
check_call(command, cwd=self.build_dir, env=env)
File “/home/work/anaconda3/envs/fromsrc1/lib/python3.8/subprocess.py”, line 364, in check_call

Can somebody help?

ptrblck · March 24, 2020, 5:16am

I’m not sure what’s going on, but apparently the 3rd party NCCL fails to install.
Could you install NCCL locally and set USE_SYSTEM_NCCL with its path while compiling?

Epoching · September 30, 2022, 12:51am

This worked

Anyone trying to install on IBM Power 8/9 machines, I did the following:

conda install -c conda-forge cudatoolkit nccl cudnn

Then build from source (Same as documentation, but added the USE_SYSTEM_NCCL=1):

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
USE_SYSTEM_NCCL=1 python setup.py install

This fixed the NCCL errors, but ran into other caffe2 cuda errors afterwards. What worked was upgrading cudatoolkit to be exactly 11.6:

conda install -c conda-forge cudatoolkit==11.6 nccl cudnn

Also updating paths ($PATH and $LD_LIBRARY_PATH variables) to use gcc 8.3.1 and cuda 11.6

Then running commands exactly as they are from documentation worked fine:

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
USE_SYSTEM_NCCL=1 python setup.py install