Error when building custom CUDA kernels with PyTorch 1.6.0

jasrk11 · August 26, 2020, 2:25am

I’m unable to build the FlowNet 2.0 CUDA kernels for the layers channelnorm, resample2d, correlation when using PyTorch >= 1.5.1. However, I’m able to successfully build and use them with PyTorch <= 1.4.0. Is there a way to make this work since I need to use PyTorch >= 1.5.1?
Following is a snippet of the long error log that I get:

ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File “/home/software/miniconda2/envs/cu101torch151s2/lib/python3.6/site-packages/torch/utils/cpp_extension.py”, line 1423, in _run_ninja_build
check=True)
File “/home/software/miniconda2/envs/cu101torch151s2/lib/python3.6/subprocess.py”, line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command ‘[‘ninja’, ‘-v’]’ returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “setup.py”, line 32, in
‘build_ext’: BuildExtension
File “/home/software/miniconda2/envs/cu101torch151s2/lib/python3.6/site-packages/setuptools/init.py”, line 163, in setup
return distutils.core.setup(**attrs)

.
.
.
.

File “/home/software/miniconda2/envs/cu101torch151s2/lib/python3.6/site-packages/torch/utils/cpp_extension.py”, line 1163, in _write_ninja_file_and_compile_objects
error_prefix=‘Error compiling objects for extension’)
File “/home/software/miniconda2/envs/cu101torch151s2/lib/python3.6/site-packages/torch/utils/cpp_extension.py”, line 1436, in _run_ninja_build
raise RuntimeError(message)
RuntimeError: Error compiling objects for extension

For reproducing the error with PyTorch >= 1.5.1 (installed using conda):

# get flownet2-pytorch source
git clone https://github.com/NVIDIA/flownet2-pytorch.git
cd flownet2-pytorch

# install custom layers
bash install.sh

ptrblck · August 26, 2020, 8:32am

Could you disable ninja for the build of the custom extension and post the stack trace with the error message here, please?

jasrk11 · August 26, 2020, 12:15pm

Hi @ptrblck, I disabled ninja for the build. The complete stack trace is too long for my own terminal but here are some of the error messages:

/home/rakesh/software/miniconda2/envs/cu101torch16/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/nn/modules/container/sequential.h: In member function ‘ReturnType torch::nn::SequentialImpl::forward(InputTypes&& ...)’:
/home/rakesh/software/miniconda2/envs/cu101torch16/lib/python3.6/site-packages/torch/include/c10/util/Exception.h:333:9: error: ‘str’ is not a member of ‘c10’
         ::c10::str(__VA_ARGS__),                                      \
         ^
.....

/home/rakesh/software/miniconda2/envs/cu101torch16/lib/python3.6/site-packages/torch/include/c10/util/TypeCast.h:57:58: error: ‘apply’ is not a member of ‘c10::maybe_real<true, c10::complex<double> >’
       static_cast<int64_t>(maybe_real<real, src_t>::apply(src)));
                                                          ^
.....

/home/rakesh/software/miniconda2/envs/cu101torch16/lib/python3.6/site-packages/torch/include/c10/util/Optional.h:408:23: error: cannot bind ‘c10::intrusive_ptr<torch::jit::InlinedCallStack>’ lvalue to ‘c10::intrusive_ptr<torch::jit::InlinedCallStack>&&’
       contained_val() = std::forward<U>(v);
                       ^
/home/rakesh/software/miniconda2/envs/cu101torch16/lib/python3.6/site-packages/torch/include/c10/util/Optional.h:408:23: error: no match for ‘operator=’ (operand types are ‘const std::shared_ptr<torch::jit::Graph>’ and ‘std::shared_ptr<torch::jit::Graph>’)

......

/home/rakesh/software/miniconda2/envs/cu101torch16/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:69:56: error: no matching function for call to ‘at::Tensor::to(const c10::Device&) const’
       auto data = device && tensor.device() != *device ?
                                                        ^

ptrblck · August 26, 2020, 6:12pm

Thanks for the stack trace.
You could pipe the log output to a file in case your terminal gets flooded.
That being said, if the posted error is the first one, I would assume that a stale build is creating this issue.
Could you clean the build and update the submodules before trying to rebuild?

python setup.py clean
git submodule update --init --recursive
python setup.py install 2>&1 | tee install.log

jasrk11 · August 26, 2020, 9:11pm

Hi @ptrblck, thanks! I noticed that the first error in install.log was

/home/rakesh/software/miniconda2/envs/cu101torch16/lib/python3.6/site-packages/torch/include/c10/util/C++17.h:24:2: error: #error You need C++14 to compile PyTorch #error You need C++14 to compile PyTorch

I changed the cxx_args in setup.py to '-std=c++14' which fixed the errors.