[SOLVED] Caffe2 C++ Segfaults with TensorCPU constructor using a vector

steplee · August 24, 2018, 9:16pm

I’ve built Caffe2 with Pytorch for Ubuntu 16.04 with cuda 9 and cudnn 7.
I’m linking both caffe2 and the GPU shared objects into this code snippet:

....

int main(int argc, char** argv) {

  caffe2::GlobalInit(&argc, &argv);

  Workspace wrk;

  auto tcpu = wrk.CreateBlob("tcpu")->GetMutable<TensorCPU>();
  std::vector<float> v(10);
  std::vector<TIndex> dims({1});
  auto val = TensorCPU(dims, v, NULL);

  google::protobuf::ShutdownProtobufLibrary();

  return 0;
}

And it segfaults with

root@45ca558c940e:/slam/build# ./app
E0824 21:06:44.960511   570 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0824 21:06:44.960811   570 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0824 21:06:44.960831   570 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
*** Aborted at 1535144804 (unix time) try "date -d @1535144804" if you are using GNU date ***
PC: @           0x40e340 caffe2::TensorImpl::TensorImpl<>()
*** SIGSEGV (@0x0) received by PID 570 (TID 0x7f4f90015880) from PID 0; stack trace: ***
    @     0x7f4f8fc03390 (unknown)
    @           0x40e340 caffe2::TensorImpl::TensorImpl<>()
    @           0x40cdf2 c10::intrusive_ptr<>::make<>()
    @           0x40bcff _ZN3c1014make_intrusiveIN6caffe210TensorImplENS1_19UndefinedTensorImplEJRKSt6vectorIlSaIlEERKS4_IfSaIfEERPNS1_11BaseContextEEEENS_13intrusive_ptrIT_T0_EEDpOT1_
    @           0x40a97f caffe2::Tensor::Tensor<>()
    @           0x407645 main
    @     0x7f4f8b2f3830 __libc_start_main
    @           0x4072e9 _start
    @                0x0 (unknown)
Segmentation fault (core dumped)

I’m running this in a docker container with nvidia runtime. I built pytorch into the docker image and set seccomp=undefined.

steplee · August 24, 2018, 9:23pm

I’ve fixed it by creating a CPUContext manually and passing it. Looks like the samples I’ve been drawing from are out of date.

dendisuhubdy · September 29, 2018, 2:03pm

Just got this error too, thanks for pointing it out.

steplee · September 29, 2018, 3:23pm

Also see https://github.com/pytorch/pytorch/issues/11317

dendisuhubdy · September 30, 2018, 2:05am

Thanks that’s helpful.