Segfaults (CPU) and double free errors (GPU) when using ATen with CUDA

alannnna · June 8, 2020, 9:10pm

I get errors both when I run on CPU (when the .to(at::kCUDA) is commented out) and on GPU (when it’s not commented out). In both cases, the errors happen after the convolution completes. This is the minimal example:

#include <ATen/Functions.h>
#include <iostream>
#include <cmath>

void debug_memory_issue() {
  int FHW = 3;
  int padding = 1;
  auto N = 1;
  auto C = 1;
  auto HW = 16;
  at::Tensor imgs = at::rand({N, C, HW, HW}); //.to(at::kCUDA);
  at::Tensor fils = at::rand({C, C, FHW, FHW}); //.to(at::kCUDA);
  std::cout << "running conv" << std::endl;
  at::conv2d(imgs, fils, /* bias */ {}, /* stride */ 1, padding);
  std::cout << "done running conv" << std::endl;
}

int main() {
  debug_memory_issue();
}

On CPU I get a segfault, and the backtrace from lldb is:

(lldb) bt
* thread #1, name = 'example-app', stop reason = signal SIGSEGV: invalid address (fault address: 0x8aee6b0)
  * frame #0: 0x00007ffff4d88100 libcudart.so.9.2`___lldb_unnamed_symbol512$$libcudart.so.9.2 + 1296
    frame #1: 0x00007ffff4d88256 libcudart.so.9.2`___lldb_unnamed_symbol513$$libcudart.so.9.2 + 54
    frame #2: 0x00007fffb61c605a libc.so.6`__cxa_finalize + 154
    frame #3: 0x00007fffb7990673 libtorch_cuda.so`__do_global_dtors_aux + 35
    frame #4: 0x00007ffff7deb07a ld-2.17.so`_dl_fini + 506
    frame #5: 0x00007fffb61c5ce9 libc.so.6`__run_exit_handlers + 217
    frame #6: 0x00007fffb61c5d37 libc.so.6`exit + 23
    frame #7: 0x00007fffb61ae55c libc.so.6`__libc_start_main + 252
    frame #8: 0x0000000000417f39 example-app`_start + 41
(lldb)

On GPU I get this error:

*** Error in `./example-app': double free or corruption (!prev): 0x00000000017f5610 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x7f96d02ec299]
/usr/local/cuda/lib64/libcudart.so.9.2(+0x1dde7)[0x7f970ee66de7]
/usr/local/cuda/lib64/libcudart.so.9.2(+0x1e256)[0x7f970ee67256]
/lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f96d02a505a]
/home/path/to/libtorch/lib/libtorch_cuda.so(+0xf19673)[0x7f96d1a6f673]
======= Memory map: ========
00400000-00438000 r-xp 00000000 00:28 266742313                          /data/users/me/path/to/executable
[...]

and lldb gives me this backtrace:

(lldb) bt
* thread #1, name = 'example-app', stop reason = signal SIGABRT
  * frame #0: 0x00007fffb61c2387 libc.so.6`raise + 55
    frame #1: 0x00007fffb61c3a78 libc.so.6`abort + 328
    frame #2: 0x00007fffb6204ed7 libc.so.6`__libc_message + 983
    frame #3: 0x00007fffb620d299 libc.so.6`_int_free + 1305
    frame #4: 0x00007ffff4d87de7 libcudart.so.9.2`___lldb_unnamed_symbol512$$libcudart.so.9.2 + 503
    frame #5: 0x00007ffff4d88256 libcudart.so.9.2`___lldb_unnamed_symbol513$$libcudart.so.9.2 + 54
    frame #6: 0x00007fffb61c605a libc.so.6`__cxa_finalize + 154
    frame #7: 0x00007fffb7990673 libtorch_cuda.so`__do_global_dtors_aux + 35
    frame #8: 0x00007ffff7deb07a ld-2.17.so`_dl_fini + 506
    frame #9: 0x00007fffb61c5ce9 libc.so.6`__run_exit_handlers + 217
    frame #10: 0x00007fffb61c5d37 libc.so.6`exit + 23
    frame #11: 0x00007fffb61ae55c libc.so.6`__libc_start_main + 252
    frame #12: 0x0000000000419bd9 example-app`_start + 41
(lldb)

I’ve tried using valgrind, but the memory errors are numerous and I can’t pinpoint what I’m doing wrong.

The closest thing on the internet I’ve found is this issue but I believe I’m using a supported version of cuda (9.2.88) and the right version of cudnn (7.6.5).

What am I doing wrong here?

ptrblck · June 10, 2020, 6:25am

The code is running fine for me:

# main.cpp
#include <ATen/Functions.h>
#include <iostream>
#include <cmath>

void debug_memory_issue() {
  int FHW = 3;
  int padding = 1;
  auto N = 1;
  auto C = 1;
  auto HW = 16;
  at::Tensor imgs = at::rand({N, C, HW, HW}).to(at::kCUDA);
  at::Tensor fils = at::rand({C, C, FHW, FHW}).to(at::kCUDA);
  std::cout << "running conv" << std::endl;
  at::conv2d(imgs, fils, /* bias */ {}, /* stride */ 1, padding);
  std::cout << "done running conv" << std::endl;
}

int main() {
  debug_memory_issue();
}

# CMakeLists.txt
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(example-app)

find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

add_executable(example-app main.cpp)
target_link_libraries(example-app "${TORCH_LIBRARIES}")
set_property(TARGET example-app PROPERTY CXX_STANDARD 14)

# in ./builds
$ cmake -DCMAKE_PREFIX_PATH=PATH_TO_LIBTORCH ..
$ cmake --build . --config Release
$ ./example-app
$ running conv
$ done running conv

I’ve tested it using libtorch 1.5.0 with CUDA10.2.

Maybe your CUDA installation is too old for your GPU or incompatible with the driver you are using?
Have you used the local CUDA installation before or did you use the PyTorch binaries, which ship with CUDA?

alannnna · June 12, 2020, 4:32pm

Thanks for testing it! I verified that CUDA was working by compiling and running a hello world example, so it wasn’t that. I had installed cuDNN separately from PyTorch.

Replacing:
target_link_libraries(example-app "${TORCH_LIBRARIES}")
with:
target_link_libraries(example-app torch)
magically did it for me. My ${TORCH_LIBRARIES} contained some local paths for CUDA libraries so perhaps they were getting in the way of the CUDA that shipped with PyTorch.

Thanks!