Hi. When I process an image of size larger than some threshold (512, 512)
, when moving the tensor to CPU in the main function, it throws this error
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: invalid argument
Exception raised from copy_kernel_cuda at /pytorch/aten/src/ATen/native/cuda/Copy.cu:200 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7ffbbba9cb89 in /home/justanhduc/Documents/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0x2b1aff3 (0x7ffb52e3fff3 in /home/justanhduc/Documents/libtorch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd42f83 (0x7ffbaa131f83 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xd3ffa1 (0x7ffbaa12efa1 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #4: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x53 (0x7ffbaa130d93 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #5: at::Tensor& c10::Dispatcher::callWithDispatchKey<at::Tensor&, at::Tensor&, at::Tensor const&, bool>(c10::TypedOperatorHandle<at::Tensor& (at::Tensor&, at::Tensor const&, bool)> const&, c10::DispatchKey, at::Tensor&, at::Tensor const&, bool) const + 0x1e7 (0x7ffbaa9212d7 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #6: at::Tensor::copy_(at::Tensor const&, bool) const + 0xcd (0x7ffbaaa48ded in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x32cf9d8 (0x7ffbac6be9d8 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #8: at::Tensor& c10::Dispatcher::callWithDispatchKey<at::Tensor&, at::Tensor&, at::Tensor const&, bool>(c10::TypedOperatorHandle<at::Tensor& (at::Tensor&, at::Tensor const&, bool)> const&, c10::DispatchKey, at::Tensor&, at::Tensor const&, bool) const + 0x1e7 (0x7ffbaa9212d7 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #9: at::Tensor::copy_(at::Tensor const&, bool) const + 0xcd (0x7ffbaaa48ded in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #10: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) + 0x17ef (0x7ffbaa3d98ff in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x158d5e6 (0x7ffbaa97c5e6 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x15eb7b3 (0x7ffbaa9da7b3 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0xaf2d7f (0x7ffba9ee1d7f in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x133045a (0x7ffbaa71f45a in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #15: at::Tensor::to(c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x315 (0x7ffbaaa5d045 in /home/justanhduc/Documents/libtorch/lib/libtorch_cpu.so)
frame #16: main() [0x4183ed]
frame #17: __libc_start_main + 0xf0 (0x7ffb4f2cf840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: main() [0x416e89]
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
However, when I move it to CPU before returning to the main function, it works fine.
I will try to reproduce by a minimal example as the whole project is kinda large, but please let me know if you have experienced it before, and if possible, any solution to this problem.
Thanks in advance!