Pybind - Pytorch Segfault?

soulslicer · July 29, 2020, 4:10am

I have a simple toy example made in pybind / c++ that compiles:

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <pybind11/numpy.h>
#include <stdexcept>

#include <torch/extension.h>

namespace
{

namespace py = pybind11;

torch::Tensor test2(){
    auto label_map = torch::zeros({100, 100}).to(torch::kFloat32);
    return label_map;
}

} // namespace

// The first argument needs to match the name of the *.so in the BUILD file.
PYBIND11_MODULE(tools_binding, m)
{
    m.doc() = "Tools functions";

    m.def("test2",
          &test2,
          "test2"
        );

}

However, when i invoke test2(), it segfaults with

Segmentation fault

If i dont return anything however:

void test2(){
    auto label_map = torch::zeros({100, 100}).to(torch::kFloat32);
}

Then it doesnt segfault. Whats going on?

I did a GDB and found the following:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffef60d014 in THPVariable_NewWithVar(_typeobject*, at::Tensor) ()

evilgras · August 18, 2021, 4:27pm

Hi @soulslicer, did you ever manage to fix this issue? I think I have the exact same bug:

#include <torch/extension.h>

int square_sum(int size0, int size1, int size2)
{
	return size0*size0 + size1*size1 + size2*size2;
}

torch::Tensor zero_tensor(int size0, int size1, int size2)
{
	return torch::zeros({size0, size1, size2});
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("square_sum", &square_sum, "Square Sum");
    m.def("zero_tensor", &zero_tensor, "Zero Tensor");
}

The square_sum method works, the zero_tensor fails with a segmentation fault, and gdb shows the same thing as you.

I tried it with the precompiled libtorch and with a libtorch I compiled from source. I tried with and without anaconda. I tried compiling using a setup.py file and with a simple Makefile. I tried changing torch::Tensor to at::Tensor. Nothing seems to work.

Here is some output from valgrind:

==355897== Invalid read of size 8
==355897==    at 0x11C6A347: THPVariable_NewWithVar(_typeobject*, at::Tensor, c10::impl::PyInterpreterStatus) (in /home/evilgras/opt/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
==355897==    by 0x11C6AB0E: THPVariable_Wrap(at::Tensor) (in /home/evilgras/opt/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
==355897==    by 0x486FD85: void pybind11::cpp_function::initialize<at::Tensor (*&)(int, int, int), at::Tensor, int, int, int, pybind11::name, pybind11::scope, pybind11::sibling, char [12]>(at::Tensor (*&)(int, int, int), at::Tensor (*)(int, int, int), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [12])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) (pybind.h:47)
==355897==    by 0x486D2D9: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) (pybind11.h:767)
==355897==    by 0x25C347: cfunction_call_varargs (call.c:743)
==355897==    by 0x25C347: PyCFunction_Call (call.c:773)
==355897==    by 0x24BDBB: _PyObject_MakeTpCall (call.c:159)
==355897==    by 0x2D7665: _PyObject_Vectorcall (abstract.h:125)
==355897==    by 0x2D7665: call_function (ceval.c:4963)
==355897==    by 0x2D7665: _PyEval_EvalFrameDefault (ceval.c:3469)
==355897==    by 0x2A126F: PyEval_EvalFrameEx (ceval.c:741)
==355897==    by 0x2A126F: _PyEval_EvalCodeWithName (ceval.c:4298)
==355897==    by 0x336542: PyEval_EvalCodeEx (ceval.c:4327)
==355897==    by 0x336542: PyEval_EvalCode (ceval.c:718)
==355897==    by 0x3365E3: run_eval_code_obj (pythonrun.c:1165)
==355897==    by 0x35C853: run_mod (pythonrun.c:1187)
==355897==    by 0x21D38F: pyrun_file (pythonrun.c:1084)
==355897==  Address 0x130 is not stack'd, malloc'd or (recently) free'd
==355897== 
==355897== 
==355897== Process terminating with default action of signal 11 (SIGSEGV)
==355897==  Access not within mapped region at address 0x130
==355897==    at 0x11C6A347: THPVariable_NewWithVar(_typeobject*, at::Tensor, c10::impl::PyInterpreterStatus) (in /home/evilgras/opt/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
==355897==    by 0x11C6AB0E: THPVariable_Wrap(at::Tensor) (in /home/evilgras/opt/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
==355897==    by 0x486FD85: void pybind11::cpp_function::initialize<at::Tensor (*&)(int, int, int), at::Tensor, int, int, int, pybind11::name, pybind11::scope, pybind11::sibling, char [12]>(at::Tensor (*&)(int, int, int), at::Tensor (*)(int, int, int), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [12])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) (pybind.h:47)
==355897==    by 0x486D2D9: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) (pybind11.h:767)
==355897==    by 0x25C347: cfunction_call_varargs (call.c:743)
==355897==    by 0x25C347: PyCFunction_Call (call.c:773)
==355897==    by 0x24BDBB: _PyObject_MakeTpCall (call.c:159)
==355897==    by 0x2D7665: _PyObject_Vectorcall (abstract.h:125)
==355897==    by 0x2D7665: call_function (ceval.c:4963)
==355897==    by 0x2D7665: _PyEval_EvalFrameDefault (ceval.c:3469)
==355897==    by 0x2A126F: PyEval_EvalFrameEx (ceval.c:741)
==355897==    by 0x2A126F: _PyEval_EvalCodeWithName (ceval.c:4298)
==355897==    by 0x336542: PyEval_EvalCodeEx (ceval.c:4327)
==355897==    by 0x336542: PyEval_EvalCode (ceval.c:718)
==355897==    by 0x3365E3: run_eval_code_obj (pythonrun.c:1165)
==355897==    by 0x35C853: run_mod (pythonrun.c:1187)
==355897==    by 0x21D38F: pyrun_file (pythonrun.c:1084)

Not sure if this is a hint, but 0x130 seems like a very low number for a memory address (from the message Access not within mapped region at address 0x130 ).

Can anyone help? I’m using Ubuntu 20.04 with gcc 9.3.0 if that helps.

evilgras · August 19, 2021, 10:15am

Update: I managed to avoid the segfault when calling zero_tensor in my module. The trick was to add an import torch to my python script!

Previously, my python script looked like this:

#!/usr/bin/python3
import dummy

dummy.zero_tensor(3,4,5)

(here, dummy is the name of my module). The above script causes a segfault. But when I modify it to be

#!/usr/bin/python3
import dummy
import torch

dummy.zero_tensor(3,4,5)

… it works.

Why this happens:

In torch/csrc/autograd/python_variable.cpp there is a variable called THPVariableClass, which is initialised to nullptr. The method THPVariable_Wrap (which we can see in the valgrind trace above) calls THPVariable_NewWithVar with (PyTypeObject*)THPVariableClass as the type argument, which is still the null pointer. THPVariable_NewWithVar then tries to call type->tp_alloc(type, 0) (where type is the nullptr), causing the segmentation fault.

Normally, it seems THPVariableClass is supposed to be initialised by the function PyObject* THPAutograd_initExtension(PyObject* _unused, PyObject *unused) (inside torch/csrc/autograd/init.cpp) which, it turns out, is automatically called when you do import torch. So adding import torch fixed the problem.

Is this is a bug in pytorch? My guess would be that importing any custom pytorch/pybind C++ module should automatically cause torch to be imported too in order for this segfault not to happen. At the very least I would say that either THPVariable_Wrap or THPVariable_NewWithVar should check that THPVariableClass is initialised (not null) and, if not, throw an exception along with instructions on how the user can fix the issue.

ptrblck · August 19, 2021, 10:54pm

Thanks for sharing the debugging and I agree that a proper error would be preferable to a seg fault. Would you mind creating an issue on GitHub with your explanation and debugging steps?

soulslicer · August 20, 2021, 4:25am

correct. i just imported torch before that and it was resolved. Thanks for doing a detailed breakdown of this issue

evilgras · August 23, 2021, 9:14am

Created a github issue.