Error in loss.backward() When Using Python API in C++ Frontend

tttsss-01 · June 24, 2023, 9:28am

I’m trying to train my model using the C++ frontend API. In the process of handling the training dataset, I used the Python API to invoke my Python script for data preprocessing and loading. However, when I reach the loss.backward();, an error occurs. The error message is as follows:

terminate called after throwing an instance of 'c10::Error'
  what():  The autograd engine was called while holding the GIL. If you are using the C++ API, the autograd engine is an expensive operation that does not require the GIL to be held so you should release it with 'pybind11::gil_scoped_release no_gil;'. If you are not using the C++ API, please report a bug to the pytorch team.
Exception raised from execute at ../torch/csrc/autograd/python_engine.cpp:131 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffff7f4b4d7 in /home/stone/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7ffff7f15434 in /home/stone/libtorch/lib/libc10.so)
frame #2: torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x8f (0x7fff4dbdb89f in /home/stone/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x4abdbc1 (0x7fffe2e7abc1 in /home/stone/libtorch/lib/libtorch_cpu.so)
frame #4: torch::autograd::backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<bool>, bool, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x5c (0x7fffe2e7ca1c in /home/stone/libtorch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x4b1429e (0x7fffe2ed129e in /home/stone/libtorch/lib/libtorch_cpu.so)
frame #6: at::Tensor::_backward(c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 0x48 (0x7fffdff59a98 in /home/stone/libtorch/lib/libtorch_cpu.so)
frame #7: at::Tensor::backward(at::Tensor const&, c10::optional<bool>, bool, c10::optional<c10::ArrayRef<at::Tensor> >) const + 0x144 (0x5555555e42ac in /home/stone/projects/gru/build/train)
frame #8: main + 0x8e9 (0x5555555e1c47 in /home/stone/projects/gru/build/train)
frame #9: <unknown function> + 0x29d90 (0x7fff895d7d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #10: __libc_start_main + 0x80 (0x7fff895d7e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #11: _start + 0x25 (0x5555555e1265 in /home/stone/projects/gru/build/train)

I don’t get this error if I don’t use the Python API.
According to the error message, I tried using pybind11::gil_scoped_release no_gil;, but I received the following error: namespace 'pybind11' has no member 'gil_scoped_release'.

ptrblck · June 24, 2023, 7:01pm

Maybe includes are missing? Did you try to use:

#include <pybind11/pybind11.h>
#include <pybind11/functional.h>
#include <torch/csrc/utils/pybind.h>

tttsss-01 · June 25, 2023, 2:24am

After adding includes, I got new error.

tttsss-01 · June 25, 2023, 12:06pm

I got another error.

ptrblck · June 25, 2023, 7:58pm

You would need to debug the segfault via gdb to narrow down where this memory access violation comes from.

TiStar · January 7, 2025, 8:08am

Any updates on this? I’m facing the same error.
Thanks in advance.