Crash when changing the input dimensions

v_kubicki_github · February 8, 2024, 3:41pm

I have a class Weight which stores an AnyModule. The AnyModule weights are loaded from the disk in the Weight constructor.

There is a class Data in which I store custom (non torch) tensors of input and output tensors, which are converted to / from torch::Tensor for inference.

Those local tensors are not resizable. So if I want to perform inference for different input tensors dimensions, I must instantiate different Data instances.

The Data instantiation is stored in a unique_pointer which is reset when the dimensions change :

if (!data || data->get_params() != params)
{
	data = std::make_unique<typename Inference_Backend::Data>(params, weights);
}

When I run an inference, one of the custom tensors in Data is set to to input value, then there is a conversion to a Torch tensor, and the forward method from the AnyModule in Weight is called. In the case of a dataset with variable image sizes, the process starts correctly and the first inputs are processed one at a time with a batch size of 1. However, when the dimensions of the input change, I get a runtime error (see log at the end).

The error seems related to some copying of CUDA kernels. I have set up everything (model and tensor) to be located on the first CUDA device. What I do not understand is that the error is related to Torch, but when a new instance of Data is created, the existing AnyModule is not modified as it is located in Weight.

Could you help me ? Regards.

Vincent

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f9daf25909b in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7f9daf253c4f in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x58f (0x7f9da965f05f in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libc10_cuda.so)
frame #3: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 10u>, float (float)> >(at::TensorIteratorBase&, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 10u>, float (float)> const&) + 0x8f6 (0x7f9d42bfe196 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cuda.so)
frame #4: void at::native::gpu_kernel<__nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 10u>, float (float)> >(at::TensorIteratorBase&, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 10u>, float (float)> const&) + 0x143 (0x7f9d42bfe933 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cuda.so)
frame #5: at::native::direct_copy_kernel_cuda(at::TensorIteratorBase&) + 0x280 (0x7f9d42be29a0 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cuda.so)
frame #6: at::native::copy_device_to_device(at::TensorIterator&, bool, bool) + 0x705 (0x7f9d42be33c5 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x18553c0 (0x7f9d42be43c0 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x15d95bd (0x7f9d93f9b5bd in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #9: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x75 (0x7f9d93f9d645 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #10: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x165 (0x7f9d94ca4bf5 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #11: at::native::clone(at::Tensor const&, c10::optional<c10::MemoryFormat>) + 0x1de (0x7f9d942d235e in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x268ba33 (0x7f9d9504da33 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #13: at::_ops::clone::call(at::Tensor const&, c10::optional<c10::MemoryFormat>) + 0x145 (0x7f9d949c2405 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #14: at::native::contiguous(at::Tensor const&, c10::MemoryFormat) + 0x71 (0x7f9d942da5f1 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x286afb3 (0x7f9d9522cfb3 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #16: at::_ops::contiguous::call(at::Tensor const&, c10::MemoryFormat) + 0x14b (0x7f9d94dcc2eb in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #17: at::TensorBase::__dispatch_contiguous(c10::MemoryFormat) const + 0x2f (0x7f9d93dcf72f in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x113e25b (0x7f9d424cd25b in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cuda.so)
frame #19: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0xa5 (0x7f9d424cd645 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cuda.so)
frame #20: <unknown function> + 0x2eb9451 (0x7f9d44248451 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cuda.so)
frame #21: <unknown function> + 0x2eb94ef (0x7f9d442484ef in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cuda.so)
frame #22: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x21f (0x7f9d94ca50bf in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #23: at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x12cd (0x7f9d93f6af7d in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #24: <unknown function> + 0x269059c (0x7f9d9505259c in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #25: <unknown function> + 0x2690664 (0x7f9d95052664 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #26: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool) + 0x2b5 (0x7f9d9480c9c5 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #27: at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x15f (0x7f9d93f61b6f in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #28: <unknown function> + 0x2690182 (0x7f9d95052182 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #29: <unknown function> + 0x2690202 (0x7f9d95052202 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #30: at::_ops::convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long) + 0x227 (0x7f9d947d5857 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #31: <unknown function> + 0x3a795c0 (0x7f9d9643b5c0 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #32: <unknown function> + 0x3a7a246 (0x7f9d9643c246 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #33: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long) + 0x23c (0x7f9d9480be0c in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #34: at::native::conv2d_symint(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, long) + 0x20e (0x7f9d93f64ebe in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #35: <unknown function> + 0x286b1f2 (0x7f9d9522d1f2 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #36: at::_ops::conv2d::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, long) + 0x203 (0x7f9d94dcdc03 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #37: <unknown function> + 0x51bc2c8 (0x7f9d97b7e2c8 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #38: torch::nn::Conv2dImpl::_conv_forward(at::Tensor const&, at::Tensor const&) + 0x4a1 (0x7f9d97b775b1 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #39: torch::nn::Conv2dImpl::forward(at::Tensor const&) + 0x24 (0x7f9d97b776f4 in /home/vincent/rc3.vincent_pytorch/programs/submodules/pytorch_cu121/libtorch/lib/libtorch_cpu.so)
frame #40: crazy_tensor::crazy_torch::CBAImpl::forward(at::Tensor) + 0x2e (0x7f9da95d1eee in /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/libcrazy_torch.so)
frame #41: torch::nn::AnyValue torch::nn::AnyModuleHolder<crazy_tensor::crazy_torch::CBAImpl, at::Tensor>::InvokeForward::operator()<at::Tensor>(at::Tensor&&) + 0x31 (0x7f9da95d3261 in /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/libcrazy_torch.so)
frame #42: torch::nn::AnyModuleHolder<crazy_tensor::crazy_torch::CBAImpl, at::Tensor>::forward(std::vector<torch::nn::AnyValue, std::allocator<torch::nn::AnyValue> >&&) + 0x327 (0x7f9da95d26b7 in /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/libcrazy_torch.so)
frame #43: /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/torch_stone_infer() [0x47ead8]
frame #44: /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/torch_stone_infer() [0x47e74f]
frame #45: /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/torch_stone_infer() [0x47e249]
frame #46: /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/torch_stone_infer() [0x47d6e3]
frame #47: /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/torch_stone_infer() [0x442ed1]
frame #48: /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/torch_stone_infer() [0x434034]
frame #49: /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/torch_stone_infer() [0x43126f]
frame #50: /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/torch_stone_infer() [0x430619]
frame #51: __libc_start_main + 0xf3 (0x7f9d40e50083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: /home/vincent/rc3.vincent_pytorch/programs/kifu-snap/compcmake/clang_release/torch_stone_infer() [0x42d62e]

Aborted (core dumped)