Help Needed from vLLM team on profiling pytorch cuda memory

youkaichao1 · March 16, 2024, 8:40am

Hi, I’m working on GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs , and the recently release of pytorch 2.2.0 caused some trouble to me. I came here for help in profiling cuda memory usage.

The basic story: vLLM tries to allocate as much memory as possible for KV Cache to accelerate LLM inference. In order to do so, it first profiles the memory usage, guess the maximum size of memory available for KV Cache, and also leave some for storing activation during inference.

What’s more, vLLM uses cuda graph to reduce Python overhead.

When PyTorch upgrades from 2.1.2 to 2.2.0 , there seems to be some internal change of memory allocator, and the memory that can be used is decreased. It can cause OOM error during inference.

Here are the diagnoses data (produced by torch.cuda.memory_stats and torch.cuda.memory._dump_snapshot), collected from a server with 2 L4 GPUs:

PyTorch 2.1.2 works with or without cuda graph; PyTorch 2.2.0 works only without cuda graph, but OOM with cuda graph.

So I have three questions:

Is there anything that indeed changed in cuda graph from pytorch 2.1.2 to 2.2.0 , leading to the change of memory allocation with cuda graph?
If yes, are there any knobs that can be used to control the behavior?
At worst, if pytorch 2.2.0 indeed uses more internal memory or causes more memory fragment, what is the reliable way to calculate the maximum memory for vLLM’s usecase to allocate KV Cache?

Any help is appreciated!

ptrblck · March 16, 2024, 1:55pm

Did you try to bisect the issue using nightly binaries?
Also, is the issue still observable in the latest nightly?

youkaichao1 · March 16, 2024, 3:52pm

How to biset all the nightly binaries? I only find 2.2.0.dev20231010+cu121 available, other nightly binaries for 2.2.0 do not exist any more in pip.

youkaichao1 · March 16, 2024, 4:01pm

The last resort is to biset the pytorch github repo

albanD · March 16, 2024, 5:33pm

You can look for the index-url in the install instructions to see the nightlies hosted by us (pypi has strict limits on storage so we have very little there).
You can see https://download.pytorch.org/whl/nightly/torch/ for example for the latest cuda enabled ones.

You can see there the +cpu and +cu118 builds corresponding to cpu and cuda11.8 and the other ones are the default cuda 12.1.

ptrblck · March 16, 2024, 5:44pm

Answered here.

youkaichao1 · March 16, 2024, 9:45pm

Thank you, I can install previous nightly build now.

Another question is how to associate nightly build date with github commit. Say if I find 2.2.0.dev20231210+cu121 works but 2.2.0.dev20231211+cu121 does not work, then I need to find the commit between December 10 and December 11.

Would looking into the commit history by day work? e.g. looking for the dates in
Commits · pytorch/pytorch · GitHub .

ptrblck · March 17, 2024, 3:28pm

Yes, the commit history would reflect the changes between the nightly builds. I don’t know when exactly the nightly commit is cloned (maybe midnight Pacific Time), but narrowing down the actual commit should be easy based on the dates of the nightlies in any case.