The problem
When PyTorch loads a model, glibc allocates large memory arenas via malloc. When you unload with del model + gc.collect() + torch.cuda.empty_cache(), Python releases its references — but glibc keeps the arenas because small residual allocations pin entire chunks. Memory grows with every model switch and is never returned to the OS.
This affects anyone running long-lived inference servers, Gradio apps, ComfyUI, or any pipeline that loads/unloads multiple models.
The fix
export MALLOC_MMAP_THRESHOLD_=65536
export MALLOC_TRIM_THRESHOLD_=65536
Set these before launching Python. Forces allocations >64KB to use mmap() instead of arenas. mmap pages are returned to the OS immediately on free — no fragmentation.
Proof
- Before: RSS grew ~3GB per model switch, OOM after 17 hours and 107 switches
- After: RSS flat at 955MB across 107 consecutive switches between 13 different checkpoints (SDXL, Flux, PixArt, SD 1.5, Playground v2.5)
- Tested with diffusers/FastAPI on an AMD RX 7800 XT (ROCm) and NVIDIA GTX 1080 Ti (CUDA)
No code changes. No hook removal. No gc hacks. Just two environment variables.
Full write-up with methodology and data: GitHub - brjen/pytorch-memory-fix: Two environment variables that fix PyTorch/glibc memory creep on Linux. Zero code changes. Zero performance cost. · GitHub