Cannot move tensor to CPU in the main function

justanhduc · January 10, 2021, 11:25am

Hi. When I process an image of size larger than some threshold (512, 512), when moving the tensor to CPU in the main function, it throws this error

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: invalid argument
Exception raised from copy_kernel_cuda at /pytorch/aten/src/ATen/native/cuda/Copy.cu:200 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7ffbbba9cb89 in /home/justanhduc/Documents/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0x2b1aff3 (0x7ffb52e3fff3 in /home/justanhduc/Documents/libtorch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd42f83 (0x7ffbaa131f83 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xd3ffa1 (0x7ffbaa12efa1 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #4: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x53 (0x7ffbaa130d93 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #5: at::Tensor& c10::Dispatcher::callWithDispatchKey<at::Tensor&, at::Tensor&, at::Tensor const&, bool>(c10::TypedOperatorHandle<at::Tensor& (at::Tensor&, at::Tensor const&, bool)> const&, c10::DispatchKey, at::Tensor&, at::Tensor const&, bool) const + 0x1e7 (0x7ffbaa9212d7 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #6: at::Tensor::copy_(at::Tensor const&, bool) const + 0xcd (0x7ffbaaa48ded in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x32cf9d8 (0x7ffbac6be9d8 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #8: at::Tensor& c10::Dispatcher::callWithDispatchKey<at::Tensor&, at::Tensor&, at::Tensor const&, bool>(c10::TypedOperatorHandle<at::Tensor& (at::Tensor&, at::Tensor const&, bool)> const&, c10::DispatchKey, at::Tensor&, at::Tensor const&, bool) const + 0x1e7 (0x7ffbaa9212d7 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #9: at::Tensor::copy_(at::Tensor const&, bool) const + 0xcd (0x7ffbaaa48ded in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #10: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) + 0x17ef (0x7ffbaa3d98ff in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x158d5e6 (0x7ffbaa97c5e6 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x15eb7b3 (0x7ffbaa9da7b3 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0xaf2d7f (0x7ffba9ee1d7f in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x133045a (0x7ffbaa71f45a in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #15: at::Tensor::to(c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x315 (0x7ffbaaa5d045 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #16: main() [0x4183ed]
frame #17: __libc_start_main + 0xf0 (0x7ffb4f2cf840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: main() [0x416e89]


Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

However, when I move it to CPU before returning to the main function, it works fine.
I will try to reproduce by a minimal example as the whole project is kinda large, but please let me know if you have experienced it before, and if possible, any solution to this problem.
Thanks in advance!

glaringlee · January 11, 2021, 2:53am

From the error message, it seems to me that you tensor (image) got destroyed after returning to the main function, the reason is that when doing copy, the address is unknown. I am not 100% percent sure, please double check. And, please provide an example if possible, I can further look into it.

justanhduc · January 12, 2021, 5:58am

Thanks for the hint. I figured out the problem. I assign input.data() to a pointer and then being to careful, I unconsciously freed this pointer in the destructor of the input class. I am not sure why it affects the output though, but after commenting this part, it works!