Compiling PyTorch with tarball-installed NCCL

I installed NCCL 2.4.8 using the “O/S agnostic local installer” option from the NVIDIA website. This gave me a file nccl_2.4.8-1+cuda10.1_x86_64.txz which I extracted into a new directory /opt/nccl-2.4.8. I’m trying to compile PyTorch 1.4.1 (exactly @ git tag v1.4.1) now to use this NCCL installation, as I want it to be consistent within my Anaconda environment with other compiled applications there that use NCCL. As this is a server installation, I am also trying to make it possible to have multiple versions of CUDA, cuDNN, NCCL, TensorRT, etc in parallel, so all of them need to be totally local installs (e.g. no debs).

So far I can’t get cmake to be happy that it has found NCCL properly. That is, it finds NCCL fine (both headers and library), but then fails to identify its version, and the ‘header matches library’ check fails too (I need to manually force the version identification to ‘succeed’ in order for cmake to get that far though). Does anyone have any experience with this?

Ubuntu 18.04, PyTorch 1.4.1, CUDA 10.1.243, cuDNN 7.6.5, CMake 3.10.2, NCCL 2.4.8

# Fresh compile...
rm -rf ~/Programs/PyTorch/pytorch/build/*
cd ~/Programs/PyTorch/pytorch
conda activate dl
source /usr/local/cuda-10.1/add_path.sh
source /opt/openmpi-2.1.1/add_path.sh
source /opt/nccl-2.4.8/add_path.sh
source /opt/TensorRT-6.0.1.5/add_path.sh
export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-"$(dirname $(which conda))/../"}"
export CUDA_LIB_PATH=/usr/local/cuda-10.1/extras/system
export NCCL_ROOT_DIR=/opt/nccl-2.4.8
export USE_SYSTEM_NCCL=ON
export TENSORRT_ROOT=/opt/TensorRT-6.0.1.5
export BUILD_BINARY=ON
export BUILD_DOCS=ON
export USE_NCCL=ON
export USE_TENSORRT=ON
export USE_FFMPEG=ON
export USE_OPENMP=ON
export USE_OPENCV=ON
export USE_MKLDNN=ON
export USE_NNPACK=ON
export USE_GFLAGS=ON
export USE_GLOG=ON
export GPU_ARCH=75
$ env | grep PATH
LD_LIBRARY_PATH=/opt/TensorRT-6.0.1.5/lib:/opt/nccl-2.4.8/lib:/opt/openmpi-2.1.1/lib:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/system/lib64
CUDA_LIB_PATH=/usr/local/cuda-10.1/extras/system
CUDA_PATH=/usr/local/cuda-10.1
CMAKE_PREFIX_PATH=/home/escarda/anaconda3/envs/dl
PATH=/opt/TensorRT-6.0.1.5/bin:/opt/openmpi-2.1.1/bin:/usr/local/cuda-10.1/bin:/home/escarda/anaconda3/envs/dl/bin:/home/escarda/anaconda3/condabin:/usr/local/texlive/2019/bin/x86_64-linux:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin                       
...
$ python setup.py build --cmake-only
Building wheel torch-1.4.0a0+7404463                                                                                                                       
-- Building version 1.4.0a0+7404463                                                                                                                        
cmake -GNinja -DBUILD_BINARY=ON -DBUILD_DOCS=ON -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/escarda/Programs/PyTorch/pytorch/torch -DCMAKE_PREFIX_PATH=/home/escarda/anaconda3/envs/dl -DNUMPY_INCLUDE_DIR=/home/escarda/anaconda3/envs/dl/lib/python3.6/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/home/escarda/anaconda3/envs/dl/bin/python -DPYTHON_INCLUDE_DIR=/home/escarda/anaconda3/envs/dl/include/python3.6m -DPYTHON_LIBRARY=/home/escarda/anaconda3/envs/dl/lib/libpython3.6m.so.1.0 -DTORCH_BUILD_VERSION=1.4.0a0+7404463 -DUSE_FFMPEG=ON -DUSE_GFLAGS=ON -DUSE_GLOG=ON -DUSE_MKLDNN=ON -DUSE_NCCL=ON -DUSE_NNPACK=ON -DUSE_NUMPY=True -DUSE_OPENCV=ON -DUSE_OPENMP=ON -DUSE_SYSTEM_NCCL=ON -DUSE_TENSORRT=ON /home/escarda/Programs/PyTorch/pytorch                                                                                                                                     
-- The CXX compiler identification is GNU 7.5.0                                                                                                            
-- The C compiler identification is GNU 7.5.0
...
-- Found CUDA: /usr/local/cuda-10.1 (found version "10.1") 
-- Caffe2: CUDA detected: 10.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda-10.1/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda-10.1
-- Caffe2: Header version is: 10.1
-- Found CUDNN: /usr/local/cuda-10.1/lib64/libcudnn.so  
-- Found TENSORRT: /opt/TensorRT-6.0.1.5/include  
-- Found cuDNN: v7.6.5  (include: /usr/local/cuda-10.1/include, library: /usr/local/cuda-10.1/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s):  7.5 7.5 7.5 7.5
-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75
-- Found NCCL: /opt/nccl-2.4.8/include  
-- Determining NCCL version from /opt/nccl-2.4.8/include/nccl.h...
-- Looking for NCCL_VERSION_CODE
-- Looking for NCCL_VERSION_CODE - not found
-- NCCL version < 2.3.5-5
-- Found NCCL (include: /opt/nccl-2.4.8/include, library: /opt/nccl-2.4.8/lib/libnccl.so)
-- Could NOT find CUB (missing: CUB_INCLUDE_DIR) 
-- MPI include path: /opt/openmpi-2.1.1/include
-- MPI libraries: /opt/openmpi-2.1.1/lib/libmpi_cxx.so/opt/openmpi-2.1.1/lib/libmpi.so
-- Found CUDA: /usr/local/cuda-10.1 (found suitable version "10.1", minimum required is "7.0") 
-- CUDA detected: 10.1
-- Could NOT find NCCL (missing: NCCL_INCLUDE_DIR) 
CMake Warning at third_party/gloo/cmake/Dependencies.cmake:96 (message):
  Not compiling with NCCL support.  Suppress this warning with
  -DUSE_NCCL=OFF.
Call Stack (most recent call first):
  third_party/gloo/CMakeLists.txt:56 (include)
...
-- ******** Summary ********
-- General:
--   CMake version         : 3.10.2
--   CMake command         : /usr/bin/cmake
--   System                : Linux
--   C++ compiler          : /usr/bin/c++
--   C++ compiler id       : GNU
--   C++ compiler version  : 7.5.0
--   BLAS                  : MKL
--   CXX flags             :  -fvisibility-inlines-hidden -fopenmp -DTENSORRT_VERSION_MAJOR=6 -DTENSORRT_VERSION_MINOR=0 -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow
--   Build type            : Release
--   Compile definitions   : TH_BLAS_MKL;ONNX_ML=1;ONNX_NAMESPACE=onnx_torch;MAGMA_V2;IDEEP_USE_MKL;HAVE_MMAP=1;_FILE_OFFSET_BITS=64;HAVE_SHM_OPEN=1;HAVE_SHM_UNLINK=1;HAVE_MALLOC_USABLE_SIZE=1
--   CMAKE_PREFIX_PATH     : /home/escarda/anaconda3/envs/dl;/usr/local/cuda-10.1;/opt/nccl-2.4.8;/usr/local/cuda-10.1
--   CMAKE_INSTALL_PREFIX  : /home/escarda/Programs/PyTorch/pytorch/torch
-- 
--   TORCH_VERSION         : 1.4.0
--   CAFFE2_VERSION        : 1.4.0
--   BUILD_CAFFE2_MOBILE   : ON
--   USE_STATIC_DISPATCH   : OFF
--   BUILD_BINARY          : ON
--   BUILD_CUSTOM_PROTOBUF : ON
--     Link local protobuf : ON
--   BUILD_DOCS            : ON
--   BUILD_PYTHON          : True
--     Python version      : 3.6.10
--     Python executable   : /home/escarda/anaconda3/envs/dl/bin/python
--     Pythonlibs version  : 3.6.10
--     Python library      : /home/escarda/anaconda3/envs/dl/lib/libpython3.6m.so.1.0
--     Python includes     : /home/escarda/anaconda3/envs/dl/include/python3.6m
--     Python site-packages: lib/python3.6/site-packages
--   BUILD_CAFFE2_OPS      : ON
--   BUILD_SHARED_LIBS     : ON
--   BUILD_TEST            : True
--   BUILD_JNI             : OFF
--   INTERN_BUILD_MOBILE   : 
--   USE_ASAN              : OFF
--   USE_CUDA              : ON
--     CUDA static link    : OFF
--     USE_CUDNN           : ON
--     CUDA version        : 10.1
--     cuDNN version       : 7.6.5
--     CUDA root directory : /usr/local/cuda-10.1
--     CUDA library        : /usr/local/cuda-10.1/lib64/stubs/libcuda.so
--     cudart library      : /usr/local/cuda-10.1/lib64/libcudart.so
--     cublas library      : /usr/local/cuda-10.1/extras/system/lib64/libcublas.so
--     cufft library       : /usr/local/cuda-10.1/lib64/libcufft.so
--     curand library      : /usr/local/cuda-10.1/lib64/libcurand.so
--     cuDNN library       : /usr/local/cuda-10.1/lib64/libcudnn.so
--     nvrtc               : /usr/local/cuda-10.1/lib64/libnvrtc.so
--     CUDA include path   : /usr/local/cuda-10.1/include
--     NVCC executable     : /usr/local/cuda-10.1/bin/nvcc
--     CUDA host compiler  : /usr/bin/cc
--     USE_TENSORRT        : ON
--       TensorRT runtime library: /opt/TensorRT-6.0.1.5/lib/libnvinfer.so
--       TensorRT include path   : /opt/TensorRT-6.0.1.5/include
--   USE_ROCM              : OFF
--   USE_EIGEN_FOR_BLAS    : 
--   USE_FBGEMM            : ON
--   USE_FFMPEG            : ON
--   USE_GFLAGS            : ON
--   USE_GLOG              : ON
--   USE_LEVELDB           : OFF
--   USE_LITE_PROTO        : OFF
--   USE_LMDB              : OFF
--   USE_METAL             : OFF
--   USE_MKL               : ON
--   USE_MKLDNN            : ON
--   USE_MKLDNN_CBLAS      : OFF
--   USE_NCCL              : ON
--     USE_SYSTEM_NCCL     : ON
--   USE_NNPACK            : ON
--   USE_NUMPY             : ON
--   USE_OBSERVERS         : ON
--   USE_OPENCL            : OFF
--   USE_OPENCV            : ON
--     OpenCV version      : 4.3.0
--   USE_OPENMP            : ON
--   USE_TBB               : OFF
--   USE_PROF              : OFF
--   USE_QNNPACK           : ON
--   USE_REDIS             : OFF
--   USE_ROCKSDB           : OFF
--   USE_ZMQ               : OFF
--   USE_DISTRIBUTED       : ON
--     USE_MPI             : ON
--     USE_GLOO            : ON
--   BUILD_NAMEDTENSOR   : OFF
--   Public Dependencies  : Threads::Threads;caffe2::mkl;glog::glog;caffe2::mkldnn
--   Private Dependencies : qnnpack;pytorch_qnnpack;nnpack;cpuinfo;fbgemm;/usr/lib/x86_64-linux-gnu/libnuma.so;opencv_core;opencv_highgui;opencv_imgproc;opencv_imgcodecs;opencv_optflow;opencv_videoio;opencv_video;/usr/lib/x86_64-linux-gnu/libavcodec.so;/usr/lib/x86_64-linux-gnu/libavformat.so;/usr/lib/x86_64-linux-gnu/libavutil.so;/usr/lib/x86_64-linux-gnu/libswscale.so;fp16;/opt/openmpi-2.1.1/lib/libmpi_cxx.so;/opt/openmpi-2.1.1/lib/libmpi.so;gloo;aten_op_header_gen;foxi_loader;rt;gcc_s;gcc;dl
-- Configuring done
CMake Warning (dev) at cmake/Dependencies.cmake:1067 (add_dependencies):
  Policy CMP0046 is not set: Error on non-existent dependency in
  add_dependencies.  Run "cmake --help-policy CMP0046" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
  CMakeLists.txt:380 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.
...

There are some slightly non-standard things in there due to CUDA, cuDNN, TensorRT etc being local installs, but these installs are all found and used fine. The correct NCCL header is found (/opt/nccl-2.4.8/include/nccl.h) but NCCL_VERSION_CODE extraction fails, despite the fact that the following line is contained in the file:

#define NCCL_VERSION_CODE 2408

The correct library is also found (/opt/nccl-2.4.8/lib/libnccl.so), so why is cmake getting confused about the version? Any ideas?

The NCCL_INCLUDE_DIR seems to be missing.
Could you try to use these command and change the paths accordingly?

NCCL_INCLUDE_DIR="/usr/include/" \
NCCL_LIB_DIR="/usr/lib/" \
USE_SYSTEM_NCCL=1 \
python setup.py install

When I try your suggestion on a fresh clone of the git with the minimal possible extra configuration, I get:

git clone --recursive https://github.com/pytorch/pytorch pytorch_test
cd pytorch_test/
git checkout --recurse-submodules v1.4.1
conda activate dl
source /usr/local/cuda-10.1/add_path.sh  # Defines CUDA_PATH and modifies PATH/LD_LIBRARY_PATH for CUDA 10.1
source /opt/nccl-2.4.8/add_path.sh  # Modifies LD_LIBRARY_PATH for NCCL 2.4.8

CMAKE_PREFIX_PATH="${CONDA_PREFIX:-"$(dirname $(which conda))/../"}" \
CUDA_LIB_PATH=/usr/local/cuda-10.1/extras/system \
NCCL_INCLUDE_DIR=/opt/nccl-2.4.8/include \
NCCL_LIB_DIR=/opt/nccl-2.4.8/lib \
USE_SYSTEM_NCCL=ON \
USE_NCCL=ON \
python setup.py build --cmake-only
-- Building version 1.4.0a0+7404463
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/escarda/Programs/PyTorch/pytorch_test/torch -DCMAKE_PREFIX_PATH=/home/escarda/anaconda3/envs/dl -DNUMPY_INCLUDE_DIR=/home/escarda/anaconda3/envs/dl/lib/python3.6/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/home/escarda/anaconda3/envs/dl/bin/python -DPYTHON_INCLUDE_DIR=/home/escarda/anaconda3/envs/dl/include/python3.6m -DPYTHON_LIBRARY=/home/escarda/anaconda3/envs/dl/lib/libpython3.6m.so.1.0 -DTORCH_BUILD_VERSION=1.4.0a0+7404463 -DUSE_NCCL=ON -DUSE_NUMPY=True -DUSE_SYSTEM_NCCL=ON /home/escarda/Programs/PyTorch/pytorch_test
-- The CXX compiler identification is GNU 7.5.0
-- The C compiler identification is GNU 7.5.0
...
-- Found CUDA: /usr/local/cuda-10.1 (found version "10.1") 
-- Caffe2: CUDA detected: 10.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda-10.1/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda-10.1
-- Caffe2: Header version is: 10.1
-- Found CUDNN: /usr/local/cuda-10.1/lib64/libcudnn.so  
-- Found cuDNN: v7.6.5  (include: /usr/local/cuda-10.1/include, library: /usr/local/cuda-10.1/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s):  7.5 7.5 7.5 7.5
-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75
-- Found NCCL: /opt/nccl-2.4.8/include  
-- Determining NCCL version from /opt/nccl-2.4.8/include/nccl.h...
-- Looking for NCCL_VERSION_CODE
-- Looking for NCCL_VERSION_CODE - not found
-- NCCL version < 2.3.5-5
-- Found NCCL (include: /opt/nccl-2.4.8/include, library: /opt/nccl-2.4.8/lib/libnccl.so)
-- Could NOT find CUB (missing: CUB_INCLUDE_DIR) 
-- MPI include path: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent/include/usr/lib/x86_64-linux-gnu/openmpi/include
-- MPI libraries: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
-- Found CUDA: /usr/local/cuda-10.1 (found suitable version "10.1", minimum required is "7.0") 
-- CUDA detected: 10.1
-- Determining NCCL version from the header file: /opt/nccl-2.4.8/include/nccl.h
-- NCCL_MAJOR_VERSION: 2
-- Found NCCL (include: /opt/nccl-2.4.8/include, library: /opt/nccl-2.4.8/lib/libnccl.so)

Finding NCCL the first time fails as before with the version check.

Is there any way to add custom arguments to the cmake command line without manually hacking it into the setup code?

This would allow things like --trace --trace-expand to be used to debug what’s going on, and would allow some variables to be set that are currently unchangeable. For example, the variable TENSORRT_ROOT is intended to be used to specify the TensorRT location, but the variable is ignored because $ENV{TENSORRT_ROOT} is never queried in the CMakelists, and TENSORRT_ROOT is stripped from the cmake command line essentially because it doesn’t start with BUILD_, USE_ or CMAKE_ (see tools/setup_helpers/cmake.py:257).

Any ideas on my previous questions?

I manually added --trace --trace-expand to the cmake command line and have noticed that the problem with the version check is that when it tries to compile its little test files to test/extract the version it fails to find the NCCL and CUDA headers. This seems to be a systematic issue with the current PyTorch build that will come up whenever using a locally pre-installed NCCL and CUDA.

In order to get PyTorch to compile in my situation I had to do:

Edit cmake/Modules_CUDA_fix/upstream/FindCUDA.cmake
	Right before: #  End of unset()
	Add line: unset(CUDA_cublas_device_LIBRARY CACHE)

Edit cmake/Modules/FindNCCL.cmake
	Change:
		-  list (APPEND CMAKE_REQUIRED_INCLUDES ${NCCL_INCLUDE_DIRS})
		+  list (APPEND CMAKE_REQUIRED_INCLUDES ${NCCL_INCLUDE_DIRS} ${CUDA_INCLUDE_DIRS})
	Change:
		+          CMAKE_FLAGS "-DINCLUDE_DIRECTORIES=${NCCL_INCLUDE_DIRS};${CUDA_INCLUDE_DIRS}"
		           RUN_OUTPUT_VARIABLE NCCL_VERSION_FROM_HEADER
		-          LINK_LIBRARIES ${NCCL_LIBRARIES})
		+          LINK_LIBRARIES ${NCCL_LIBRARIES} ${CUDA_LIBRARIES})

Edit tools/setup_helpers/cmake.py
	Change: elif var.startswith(('BUILD_', 'USE_', 'CMAKE_')):
	To: elif var.startswith(('BUILD_', 'USE_', 'CMAKE_', 'TENSORRT_')):

Cmake also claims it cannot find the version of my local NCCL, but torch.cuda.nccl.version() still prints the right version after the source build.

Unfortunately, I don’t know how to add custom arguments to cmake besides the env vars and -D args.