Higher Peak Memory on First Run with torch.compile and Inductor Backend

nd21 · April 21, 2025, 7:57am

I am using torch.compile with the Inductor backend to optimize a LLaMA-2-7B model. I’ve noticed a consistent pattern: when I run the compiled model end-to-end, the peak GPU memory usage is always higher on the first run compared to subsequent runs (second, third, etc.).

This happens even though the model and inputs are identical each time. The memory delta between the first and second run is always is approximately (seq_len × seq_len × 2) bytes but I have not completely verified this for multiple data points.

For example, with seq_len = 8192, the memory drops by approximately 128 MiB between the first and second run. The behavior is deterministic and reproducible.

Unfortunately, I’m unable to share the codebase. But I am working with the HuggingFace implementation of LLaMA-2-7B, compiled using torch.compile and inductor backend.

I’m trying to understand what might be causing this delta — is it related to temporary allocations during graph capture, causal mask creation, or workspace buffers that are reused/cached in later runs? Any pointers on how to systematically debug this would be greatly appreciated.

Thanks!