Hi there,
I have successfully transformed a very complex pytorch Python model into C++ libtorch and it wasn’t easy.
The input to the model are 2 grayscale image tensors.
The output are 3 tensors.
I was able to run inference in C++ and get the same results as the pytorch inference.
BUT running inference on several images in a row causes CUDA out of memory:
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 5.78 GiB total capacity; 3.54 GiB already allocated; 21.62 MiB free; 3.81 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory
try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The inference code looks like this:
torch::jit::script::Module model = torch::jit::load(
"PATH TO MODEL", torch::kCUDA);
torch::jit::getProfilingMode() = false;
for (int i = 0;i < 100;++i)
{
// The image paths are different on every loop iteration !
cv::Mat img0_raw = cv::imread("PATH TO IMAGE", cv::IMREAD_GRAYSCALE);
cv::Mat img1_raw = cv::imread("PATH TO IMAGE", cv::IMREAD_GRAYSCALE);
cv::resize(img0_raw, img0_raw, cv::Size(480, 480), 0, 0, cv::INTER_LINEAR);
cv::resize(img1_raw, img1_raw, cv::Size(480, 480), 0, 0, cv::INTER_LINEAR);
cv::Mat img0Infer_Raw, img1Infer_Raw;
img0_raw.convertTo(img0Infer_Raw, CV_32FC1, 1/255.0);
img1_raw.convertTo(img1Infer_Raw, CV_32FC1, 1/255.0);
torch::Tensor img0Tensor = torch::from_blob(img0Infer_Raw.data, { 1, 480, 480 },
at::kFloat).unsqueeze(0).to(torch::kCUDA);
torch::Tensor img1Tensor = torch::from_blob(img1Infer_Raw.data, { 1, 480, 480 },
at::kFloat).unsqueeze(0).to(torch::kCUDA);
std::vector<torch::jit::IValue> inputs {img0Tensor, img1Tensor};
auto outputs = model.forward(inputs).toTuple();
torch::Tensor mconf = outputs->elements()[0].toTensor();
torch::Tensor kpts0_c = outputs->elements()[1].toTensor();
torch::Tensor kpts1_c = outputs->elements()[2].toTensor();
torch::Tensor cmconf = mconf.to(torch::kCPU);
torch::Tensor ckpts0_c = kpts0_c.to(torch::kCPU);
torch::Tensor ckpts1_c = kpts1_c.to(torch::kCPU);
}
Some observations:
- 3.54 already allocated by pytorch
- 3.81 is reserved by pytorch
- if the images are the same for every iteration then it doesn’t crash !
- i’ve tried calling c10::cuda::CUDACachingAllocator::emptyCache(); but it didin’t solve the problem
I suspect that pytorch cache or the reserved memory is too small,
Any help will be appreciated
Thanks in advance
Omer