I get errors both when I run on CPU (when the .to(at::kCUDA)
is commented out) and on GPU (when it’s not commented out). In both cases, the errors happen after the convolution completes. This is the minimal example:
#include <ATen/Functions.h>
#include <iostream>
#include <cmath>
void debug_memory_issue() {
int FHW = 3;
int padding = 1;
auto N = 1;
auto C = 1;
auto HW = 16;
at::Tensor imgs = at::rand({N, C, HW, HW}); //.to(at::kCUDA);
at::Tensor fils = at::rand({C, C, FHW, FHW}); //.to(at::kCUDA);
std::cout << "running conv" << std::endl;
at::conv2d(imgs, fils, /* bias */ {}, /* stride */ 1, padding);
std::cout << "done running conv" << std::endl;
}
int main() {
debug_memory_issue();
}
On CPU I get a segfault, and the backtrace from lldb is:
(lldb) bt
* thread #1, name = 'example-app', stop reason = signal SIGSEGV: invalid address (fault address: 0x8aee6b0)
* frame #0: 0x00007ffff4d88100 libcudart.so.9.2`___lldb_unnamed_symbol512$$libcudart.so.9.2 + 1296
frame #1: 0x00007ffff4d88256 libcudart.so.9.2`___lldb_unnamed_symbol513$$libcudart.so.9.2 + 54
frame #2: 0x00007fffb61c605a libc.so.6`__cxa_finalize + 154
frame #3: 0x00007fffb7990673 libtorch_cuda.so`__do_global_dtors_aux + 35
frame #4: 0x00007ffff7deb07a ld-2.17.so`_dl_fini + 506
frame #5: 0x00007fffb61c5ce9 libc.so.6`__run_exit_handlers + 217
frame #6: 0x00007fffb61c5d37 libc.so.6`exit + 23
frame #7: 0x00007fffb61ae55c libc.so.6`__libc_start_main + 252
frame #8: 0x0000000000417f39 example-app`_start + 41
(lldb)
On GPU I get this error:
*** Error in `./example-app': double free or corruption (!prev): 0x00000000017f5610 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x7f96d02ec299]
/usr/local/cuda/lib64/libcudart.so.9.2(+0x1dde7)[0x7f970ee66de7]
/usr/local/cuda/lib64/libcudart.so.9.2(+0x1e256)[0x7f970ee67256]
/lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f96d02a505a]
/home/path/to/libtorch/lib/libtorch_cuda.so(+0xf19673)[0x7f96d1a6f673]
======= Memory map: ========
00400000-00438000 r-xp 00000000 00:28 266742313 /data/users/me/path/to/executable
[...]
and lldb gives me this backtrace:
(lldb) bt
* thread #1, name = 'example-app', stop reason = signal SIGABRT
* frame #0: 0x00007fffb61c2387 libc.so.6`raise + 55
frame #1: 0x00007fffb61c3a78 libc.so.6`abort + 328
frame #2: 0x00007fffb6204ed7 libc.so.6`__libc_message + 983
frame #3: 0x00007fffb620d299 libc.so.6`_int_free + 1305
frame #4: 0x00007ffff4d87de7 libcudart.so.9.2`___lldb_unnamed_symbol512$$libcudart.so.9.2 + 503
frame #5: 0x00007ffff4d88256 libcudart.so.9.2`___lldb_unnamed_symbol513$$libcudart.so.9.2 + 54
frame #6: 0x00007fffb61c605a libc.so.6`__cxa_finalize + 154
frame #7: 0x00007fffb7990673 libtorch_cuda.so`__do_global_dtors_aux + 35
frame #8: 0x00007ffff7deb07a ld-2.17.so`_dl_fini + 506
frame #9: 0x00007fffb61c5ce9 libc.so.6`__run_exit_handlers + 217
frame #10: 0x00007fffb61c5d37 libc.so.6`exit + 23
frame #11: 0x00007fffb61ae55c libc.so.6`__libc_start_main + 252
frame #12: 0x0000000000419bd9 example-app`_start + 41
(lldb)
I’ve tried using valgrind, but the memory errors are numerous and I can’t pinpoint what I’m doing wrong.
The closest thing on the internet I’ve found is this issue but I believe I’m using a supported version of cuda (9.2.88) and the right version of cudnn (7.6.5).
What am I doing wrong here?