Hi, I’m working on GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs , and the recently release of pytorch 2.2.0 caused some trouble to me. I came here for help in profiling cuda memory usage.
The basic story: vLLM tries to allocate as much memory as possible for KV Cache to accelerate LLM inference. In order to do so, it first profiles the memory usage, guess the maximum size of memory available for KV Cache, and also leave some for storing activation during inference.
What’s more, vLLM uses cuda graph to reduce Python overhead.
When PyTorch upgrades from 2.1.2 to 2.2.0 , there seems to be some internal change of memory allocator, and the memory that can be used is decreased. It can cause OOM error during inference.
Here are the diagnoses data (produced by torch.cuda.memory_stats
and torch.cuda.memory._dump_snapshot
), collected from a server with 2 L4 GPUs:
PyTorch 2.1.2 works with or without cuda graph; PyTorch 2.2.0 works only without cuda graph, but OOM with cuda graph.
So I have three questions:
- Is there anything that indeed changed in cuda graph from pytorch 2.1.2 to 2.2.0 , leading to the change of memory allocation with cuda graph?
- If yes, are there any knobs that can be used to control the behavior?
- At worst, if pytorch 2.2.0 indeed uses more internal memory or causes more memory fragment, what is the reliable way to calculate the maximum memory for vLLM’s usecase to allocate KV Cache?
Any help is appreciated!