Why is CUDA running out memory for Llama 2 inference?

I am using a GPU of 48 GB memory and Llama 2 7b. The model easily fits into gpu memory, but when I perform inference with a long sequence length of 8k-10k tokens I run out of memory. Does anyone know why?

Sequence length of 10k tokens should only be about 10k x 10k x 4 400MB memory usage since transformer memory is O(n^2). Since I am only doing inference previous activations can be discarded.

My code looks something like this:

with torch.no_grad():
    llama_name = "meta-llama/Llama-2-7b-chat-hf"
    llama = LlamaModel.from_pretrained(llama_name).to(device)
    print('loaded model')
    for batch_idx, (times,texts) in enumerate(train_loader):
        texts = texts.to(device)
        logits = llama(texts)


Do you have any advice please?

I don’t believe a memory increase of only 400MB is expected as all intermediate activations would also increase in their size.

I understand that’s probably the case for training, but during inference there’s no need to keep the past activations while the input passes through the layers. So is that optimization not implemented in pytorch?

Yes, intermediates won’t be kept, but still computed and passed to the next layer. The peak memory would thus be smaller than during training but would still increase more than the input size delta.

Why would the intermediates be larger in size than the input? The dimension and sequence length usually remains the same when passing through transformer layers.

Hey Could u please check your inbox i sent message if you have time to look at it

Before we dive into debugging, can you share the memory that is already taken up in your Nvidia Card?

Run watch -n 1 nvidia-smi on your terminal and check the memory used just by loading the model. Without the model, ~14MB of VRam is usually consumed.

Then run the inference and see how the memory is increasing on the terminal. This must give you some idea as to where the issue is.

Yeah I’ve been doing that. Loading the model and optimizer takes up around 20 gb usually. Then during inference i get oom.

20GB of VRAM… that means you are left with another 27GB atleast.

I am just trying to find the root cause…

  • Could you please share the full code, so it can reviewed. (I must have asked it earlier… my bad)

  • Check if lesser number of tokens like 1k or 2k is getting executed. Measure the memory usage with them.

  • Another point is can you tell what is the batch_size that you are using in the train_loader? Is it 1/ 4/ 8 or 16?

Lets then try to dive into the code mechanics.

I play with 12GB Nvidia RTX 4070, and face a lot of OOM errors. Used the above tactics to get the bigger models and token counts to execute. (working with lesser resources does teach a lesson or two)