Random Memory Spikes (1300MiB) in LibTorch C++ Sequential Inference (Base: 300MiB)

dragan · December 15, 2025, 3:56pm

Hello everyone,

I am working on a C++ project using LibTorch 2.6.0+cu124. The application runs on Linux, Windows, and macOS, but I am currently facing a critical memory issue specifically on Windows (CUDA).

The Constraint: I am deploying to a device with a shared GPU (4GB Total VRAM). My application needs to respect a strict memory budget because other services share the GPU.

The Architecture:

Models: 2 Detection models (YOLO-based) and 5 Segmentation models (running as 5 folds). All are TorchScript.
Execution: Strictly sequential (single thread).
Input: Constant image size (no variation in resolution).
Data Type: All models and inputs are converted to FP16.

The Problem: My baseline memory usage is excellent: ~300MiB. However, consistently every 5-6 frames, I see a massive memory spike to 1300MiB. This 4x jump causes Out-Of-Memory errors in the shared environment.

What I have tried so far (without success):

Strict Scoping: I use strict { } scopes with torch::InferenceMode to ensure tensors are destroyed immediately after every model forward pass.

Disabling JIT: I have disabled the graph executor to prevent on-the-fly recompilation:

C++

torch::jit::getProfilingMode() = false;
torch::jit::setGraphExecutorOptimize(false);

Allocator Tuning: I have tried various PYTORCH_CUDA_ALLOC_CONF settings (including garbage_collection_threshold:0.6 and expandable_segments:True).
CuDNN Limits: I have set CUDNN_CONV_WSC_PREF=1 via _putenv to force minimal workspace algorithms.
Checks: I verified that no std::vector is accumulating GPU tensors across frames.

Despite disabling the JIT optimizer and ensuring constant input sizes, the periodic 1GB spike persists.

Has anyone experienced periodic allocator spikes like this in C++? Is there a way to strictly cap the CuDNN workspace or debug what specifically is being allocated during these spikes?

Any help would be appreciated.

Thanks, Dragan Petrovic