Hello everyone,
I am working on a C++ project using LibTorch 2.6.0+cu124. The application runs on Linux, Windows, and macOS, but I am currently facing a critical memory issue specifically on Windows (CUDA).
The Constraint: I am deploying to a device with a shared GPU (4GB Total VRAM). My application needs to respect a strict memory budget because other services share the GPU.
The Architecture:
-
Models: 2 Detection models (YOLO-based) and 5 Segmentation models (running as 5 folds). All are TorchScript.
-
Execution: Strictly sequential (single thread).
-
Input: Constant image size (no variation in resolution).
-
Data Type: All models and inputs are converted to FP16.
The Problem: My baseline memory usage is excellent: ~300MiB. However, consistently every 5-6 frames, I see a massive memory spike to 1300MiB. This 4x jump causes Out-Of-Memory errors in the shared environment.
What I have tried so far (without success):
-
Strict Scoping: I use strict
{ }scopes withtorch::InferenceModeto ensure tensors are destroyed immediately after every model forward pass. -
Disabling JIT: I have disabled the graph executor to prevent on-the-fly recompilation:
C++
torch::jit::getProfilingMode() = false; torch::jit::setGraphExecutorOptimize(false); -
Allocator Tuning: I have tried various
PYTORCH_CUDA_ALLOC_CONFsettings (includinggarbage_collection_threshold:0.6andexpandable_segments:True). -
CuDNN Limits: I have set
CUDNN_CONV_WSC_PREF=1via_putenvto force minimal workspace algorithms. -
Checks: I verified that no
std::vectoris accumulating GPU tensors across frames.
Despite disabling the JIT optimizer and ensuring constant input sizes, the periodic 1GB spike persists.
Has anyone experienced periodic allocator spikes like this in C++? Is there a way to strictly cap the CuDNN workspace or debug what specifically is being allocated during these spikes?
Any help would be appreciated.
Thanks, Dragan Petrovic