Libtorch inference causes CUDA out of memory

c0d3r43v3r · September 12, 2022, 7:01pm

Hi there,

I have successfully transformed a very complex pytorch Python model into C++ libtorch and it wasn’t easy.

The input to the model are 2 grayscale image tensors.
The output are 3 tensors.

I was able to run inference in C++ and get the same results as the pytorch inference.
BUT running inference on several images in a row causes CUDA out of memory:

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 5.78 GiB total capacity; 3.54 GiB already allocated; 21.62 MiB free; 3.81 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory
try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The inference code looks like this:


torch::jit::script::Module model = torch::jit::load(
                                     "PATH TO MODEL", torch::kCUDA);

torch::jit::getProfilingMode() = false;

for (int i = 0;i < 100;++i)
{
    // The image paths are different on every loop iteration !
    cv::Mat img0_raw = cv::imread("PATH TO IMAGE", cv::IMREAD_GRAYSCALE);
    cv::Mat img1_raw = cv::imread("PATH TO IMAGE", cv::IMREAD_GRAYSCALE);
    
    cv::resize(img0_raw, img0_raw, cv::Size(480, 480), 0, 0, cv::INTER_LINEAR);
    cv::resize(img1_raw, img1_raw, cv::Size(480, 480), 0, 0, cv::INTER_LINEAR);
    
    cv::Mat img0Infer_Raw, img1Infer_Raw;
    img0_raw.convertTo(img0Infer_Raw, CV_32FC1, 1/255.0);
    img1_raw.convertTo(img1Infer_Raw, CV_32FC1, 1/255.0);
  
    torch::Tensor img0Tensor = torch::from_blob(img0Infer_Raw.data, { 1, 480, 480 }, 
                                 at::kFloat).unsqueeze(0).to(torch::kCUDA);
  
    torch::Tensor img1Tensor = torch::from_blob(img1Infer_Raw.data, { 1, 480, 480 }, 
                                 at::kFloat).unsqueeze(0).to(torch::kCUDA);
  
    std::vector<torch::jit::IValue> inputs {img0Tensor, img1Tensor};
    auto outputs = model.forward(inputs).toTuple();
    torch::Tensor mconf = outputs->elements()[0].toTensor();
    torch::Tensor kpts0_c = outputs->elements()[1].toTensor();
    torch::Tensor kpts1_c = outputs->elements()[2].toTensor();
      
    torch::Tensor cmconf = mconf.to(torch::kCPU);
    torch::Tensor ckpts0_c = kpts0_c.to(torch::kCPU);
    torch::Tensor ckpts1_c = kpts1_c.to(torch::kCPU);
}

Some observations:

3.54 already allocated by pytorch
3.81 is reserved by pytorch
if the images are the same for every iteration then it doesn’t crash !
i’ve tried calling c10::cuda::CUDACachingAllocator::emptyCache(); but it didin’t solve the problem

I suspect that pytorch cache or the reserved memory is too small,
Any help will be appreciated
Thanks in advance

Omer

eqy · September 12, 2022, 9:36pm

One thing I would check (not entirely familiar with this use-case) is if there are some model activations being inadvertently stored.

Does calling

model.requires_grad_(false);
model.train(false);

before running inference change the behavior at all?

c0d3r43v3r · September 13, 2022, 10:16am

Thanks for idea but doing this doesn’t change the out of memory problem…
And i couldn’t find any requires_grad_(false) so i’ve used torch::requires_grad(false) which made no difference.

eqy · September 15, 2022, 8:28pm

Based on the original error message, it seems that a significant portion of GPU memory isn’t used by PyTorch, is there there some other activity on the same GPU while you are running the code?

I would also check to see if the original Python implementation has the same behavior, and you could check the memory utilization at each step e.g., with torch.cuda.memory_summary — PyTorch 1.12 documentation. I would expect the pattern of allocations/deallocations on the GPU to be basically the same in Python vs. C++.

c0d3r43v3r · September 16, 2022, 4:52am

Hi there eqy
Thanks for the inputs, i’ve noticed that when i lower the resolution of the input images all works fine and runs for 1 hours without any problems.
The python pytorch implementation doesn’t exhibit the problem.
It seems that libtorch is reserving memory in a different way than pytorch, it could be that the memory is fragmented or something else that is specific to the memory management utilization between libtorch and pytorch.