Inconsistent errors when evaluating a JIT compiled model with LibTorch

I’ve recently updated my cuda to 11.1 and I tried JIT compiling a model using the latest stable pytorch (1.7.0) and opening the JIT compiled model inside a C++ environment using LibTorch (also 1.7.0) and I’ve found that the code that evaluates the model compiles correctly but libtorch outputs errors when the model is sent into a CUDA device and evaluates tensors there. I get either PTX JIT errors or “shapes don’t match” errors inside torchscript, depending on exactly how I evaluate the model. I’m pretty sure the errors are not a product of the pytorch or C++ code itself because the code runs perfectly in CPU.
The PTX compilation error I get is:

CUDA driver error: a PTX JIT compilation failed
Exception raised from CompileToNVRTC at …/torch/csrc/jit/tensorexpr/
cuda_codegen.cpp:1204 (most recent call first):

which is strange since I’m pretty sure libtorch detected the architecture of my GPU correctly, so I don’t see why nvrtc would error out, also this didn’t use to happen with previous cuda versions.

Any idea what the problem may be?

I tried this with both cuda 11.0 and cuda 11.1 and I’ve found the same problem with both

Could you post a code snippet to reproduce this issue, please?

I’ll do my best to narrow down the issue and post a code snippet, I know without it this is too broad.

Thanks. In the meantime, did you try to run the libtorch example from the tutorials and if so was it working? In another thread from yesterday I verified that it’s running fine with 1.7 libtorch + CUDA11.0, but since you are building from source you might hit a new issue.

I did try to run the example from the tutorials, and it did work. After working a bit more on this issue I realized this is actually unrelated to CUDA 11.1/11.0.

I’m not entirely sure how the cuda support matrix works regarding cudnn, libtorch and pytorch so I’ll be as explicit as possible:

What I’m currently doing is JIT compiling a model with pytorch and then loading that model using LibTorch inside C++ code, and doing some stuff with LibTorch there. I haven’t been able to narrow this down to a minimum working example yet unfortunately, but this may be useful:

If I compile my code with CUDA Toolkit 11.1, using libtorch https://download.pytorch.org/libtorch/cu102/libtorch-shared-with-deps-1.5.0.zip
or
https://download.pytorch.org/libtorch/cu102/libtorch-shared-with-deps-1.5.1.zip
(note that both these libtorch are cu102, and I compiled the model into TorchScript using pytorch 1.5.0 and 1.5.1 respectively)
the code runs correctly in both GPU and CPU.

If I use
https://download.pytorch.org/libtorch/cu102/libtorch-shared-with-deps-1.6.0.zip
after JIT compilation with pytorch 1.6.0
the code fails with a different error:

 The following operation failed in the TorchScript interpreter.
  Traceback of TorchScript (most recent call last):
  RuntimeError: The following operation failed in the TorchScript interpreter.
  Traceback of TorchScript (most recent call last):
  RuntimeError: strides[cur - 1] == sizes[cur] * strides[cur] INTERNAL ASSERT
  FAILED at "../torch/csrc/jit/codegen/fuser/executor.cpp":176, please report a
  bug to PyTorch.

This issue occurs both in my pytorch code and in the code I execute from LibTorch

And if I run with libtorch 1.7.0 or nightly libtorch I get the PTX JIT issue, but only when I execute my model on GPU using kDouble tensors as input, and after casting the model to kDouble (my model needs high precision floating point numbers)

CUDA driver error: a PTX JIT compilation failed
Exception raised from CompileToNVRTC at …/torch/csrc/jit/tensorexpr/
cuda_codegen.cpp:1204 (most recent call first)

I̶ ̶g̶e̶t̶ ̶n̶o̶ ̶e̶r̶r̶o̶r̶s̶ ̶o̶r̶ ̶w̶a̶r̶n̶i̶n̶g̶s̶ ̶a̶t̶ ̶a̶l̶l̶ ̶i̶n̶ ̶m̶y̶ ̶p̶y̶t̶o̶r̶c̶h̶ ̶c̶o̶d̶e̶ ̶f̶o̶r̶ ̶1̶.̶7̶.̶0̶
My pytorch code also fails for 1.7.0 with the same error, but only after repeated evaluation in a loop, so I guess this is not a libtorch issue after all

I would generally recommend to use a local CUDA, which matches the pre-built binaries to avoid CUDA version conflicts.

However, based on your last comment you are using the CUDA11.0 libtorch and a local CUDA11.1 to build your application?
Could you install CUDA11.0 locally or build libtorch with your local CUDA11.1?
Also, which GPU are you using?

I’m tried this with a Quadro RTX 6000 and I also got someone else to try it with a GeForce 2080 Ti on a different machine altogether, and it failed both times. I believe I was able to narrow the issue and find a minimal code that breaks, it is related to calling autograd on a JIT compiled graph that uses torch.exp(x**2). It breaks with precompiled pytorch, without using libtorch at all, so I suspect my cuda installation is not the issue, but I guess there could be something weird going on with my driver still. I reported this at https://github.com/pytorch/pytorch/issues/47304.

Yep, thanks for the follow-up. I was already looking into the other issue and trying to narrow it down.

found similar issue, TorchScript traced by pytorch 1.6 and C++ program link to libtorch 1.7.0, code run successfully on cpu, but segmentation error on cuda:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007f8c68de0e01 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007f8c68de0e01 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007f8c68cf9747 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007f8c68cf9b2e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f8c68ec6442 in cuLaunchKernel () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f8cddaa5b2e in torch::jit::tensorexpr::CudaCodeGen::CompileToNVRTC(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#5  0x00007f8cddaae94c in torch::jit::tensorexpr::CudaCodeGen::Initialize() () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#6  0x00007f8cddab36e8 in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#7  0x00007f8cdc7c8ef6 in torch::jit::tensorexpr::CreateCodeGen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::tensorexpr::Stmt*, std::vector<torch::jit::tensorexpr::CodeGen::BufferArg, std::allocator<torch::jit::tensorexpr::CodeGen::BufferArg> > const&, c10::Device) () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#8  0x00007f8cdc84e4ba in torch::jit::tensorexpr::TensorExprKernel::compile() () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#9  0x00007f8cdc84ea22 in torch::jit::tensorexpr::TensorExprKernel::TensorExprKernel(std::shared_ptr<torch::jit::Graph> const&) () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#10 0x00007f8cdc6c8cfd in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#11 0x00007f8cdc71145e in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#12 0x00007f8cdc710f63 in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#13 0x00007f8cdc713512 in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#14 0x00007f8cdc71385f in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#15 0x00007f8cdc713a1f in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#16 0x00007f8cdc710eb5 in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#17 0x00007f8cdc713512 in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#18 0x00007f8cdc709e68 in torch::jit::Code::Code(std::shared_ptr<torch::jit::Graph> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long) ()
   from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#19 0x00007f8cdc72967c in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#20 0x00007f8cdc728706 in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#21 0x00007f8cdc6f9fb5 in ?? () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#22 0x00007f8cdc416b7a in torch::jit::GraphFunction::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#23 0x00007f8cdc4260e5 in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so
#24 0x00007f8cda1803e0 in torch::jit::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >) () from /home/gemfield/pydeepvac.cpython-36m-x86_64-linux-gnu.so

When c++ program link to pytorch 1.6, both cpu and gpu are running successfully.

Are you seeing the same issue if you are using the matching 1.7.1 versions or only in the case of mismatching releases?