Libtorch inference causes CUDA out of memory

Hi there,

I have successfully transformed a very complex pytorch Python model into C++ libtorch and it wasn’t easy.

The input to the model are 2 grayscale image tensors.
The output are 3 tensors.

I was able to run inference in C++ and get the same results as the pytorch inference.
BUT running inference on several images in a row causes CUDA out of memory:

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 5.78 GiB total capacity; 3.54 GiB already allocated; 21.62 MiB free; 3.81 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory
try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The inference code looks like this:

torch::jit::script::Module model = torch::jit::load(
                                     "PATH TO MODEL", torch::kCUDA);

torch::jit::getProfilingMode() = false;

for (int i = 0;i < 100;++i)
    // The image paths are different on every loop iteration !
    cv::Mat img0_raw = cv::imread("PATH TO IMAGE", cv::IMREAD_GRAYSCALE);
    cv::Mat img1_raw = cv::imread("PATH TO IMAGE", cv::IMREAD_GRAYSCALE);
    cv::resize(img0_raw, img0_raw, cv::Size(480, 480), 0, 0, cv::INTER_LINEAR);
    cv::resize(img1_raw, img1_raw, cv::Size(480, 480), 0, 0, cv::INTER_LINEAR);
    cv::Mat img0Infer_Raw, img1Infer_Raw;
    img0_raw.convertTo(img0Infer_Raw, CV_32FC1, 1/255.0);
    img1_raw.convertTo(img1Infer_Raw, CV_32FC1, 1/255.0);
    torch::Tensor img0Tensor = torch::from_blob(, { 1, 480, 480 }, 
    torch::Tensor img1Tensor = torch::from_blob(, { 1, 480, 480 }, 
    std::vector<torch::jit::IValue> inputs {img0Tensor, img1Tensor};
    auto outputs = model.forward(inputs).toTuple();
    torch::Tensor mconf = outputs->elements()[0].toTensor();
    torch::Tensor kpts0_c = outputs->elements()[1].toTensor();
    torch::Tensor kpts1_c = outputs->elements()[2].toTensor();
    torch::Tensor cmconf =;
    torch::Tensor ckpts0_c =;
    torch::Tensor ckpts1_c =;

Some observations:

  1. 3.54 already allocated by pytorch
  2. 3.81 is reserved by pytorch
  3. if the images are the same for every iteration then it doesn’t crash !
  4. i’ve tried calling c10::cuda::CUDACachingAllocator::emptyCache(); but it didin’t solve the problem

I suspect that pytorch cache or the reserved memory is too small,
Any help will be appreciated
Thanks in advance


One thing I would check (not entirely familiar with this use-case) is if there are some model activations being inadvertently stored.

Does calling


before running inference change the behavior at all?

Thanks for idea but doing this doesn’t change the out of memory problem…
And i couldn’t find any requires_grad_(false) so i’ve used torch::requires_grad(false) which made no difference.

Based on the original error message, it seems that a significant portion of GPU memory isn’t used by PyTorch, is there there some other activity on the same GPU while you are running the code?

I would also check to see if the original Python implementation has the same behavior, and you could check the memory utilization at each step e.g., with torch.cuda.memory_summary — PyTorch 1.12 documentation. I would expect the pattern of allocations/deallocations on the GPU to be basically the same in Python vs. C++.

Hi there eqy
Thanks for the inputs, i’ve noticed that when i lower the resolution of the input images all works fine and runs for 1 hours without any problems.
The python pytorch implementation doesn’t exhibit the problem.
It seems that libtorch is reserving memory in a different way than pytorch, it could be that the memory is fragmented or something else that is specific to the memory management utilization between libtorch and pytorch.