torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate

antonis14 · April 24, 2024, 5:46pm

Hi there.

I’m trying to run a version of the new Llama3 model.
Specifically this one NurtureAI/Meta-Llama-3-8B-Instruct-64k-GGUF · Hugging Face

When I run the code in the example, or modify the messages a bit, all works fine.

If I give a very large message ( let’s say some HTML code ) I get this error :

python ./main.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.36s/it]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Traceback (most recent call last):
  File "/home/antouank/_REPOS_/test-llama3-64k/./main.py", line 1439, in <module>
    outputs = model.generate(
              ^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 2791, in _sample
    outputs = self(
              ^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1208, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1018, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 756, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 240, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/modules/activation.py", line 396, in forward
    return F.silu(input, inplace=self.inplace)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antouank/_REPOS_/test-llama3-64k/venv/lib/python3.11/site-packages/torch/nn/functional.py", line 2102, in silu
    return torch._C._nn.silu(input)
           ^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.68 GiB. GPU

my GPU is a 4090 and for sure has more free VRAM than 1.68! So what’s the issue here?
How can I make it allow large messages?

I tried googling this but I cannot find any similar reference anywhere.

Thank you.

ptrblck · April 24, 2024, 9:56pm

The error points to a failed allocation trying to allocate 1.68GB in addition to what is already allocated. Your error message is cut but should show the overall memory usage.

antonis14 · April 25, 2024, 8:19am

that is the whole output. there’s nothing “cut”, unless I misunderstood what you say.

so a larger prompt can affect the memory needs in GBs scale??
I didn’t expect that.

I made a video, it seems like it does fill up the VRAM ( look at the nvidia-smi output below )

Is there a way to offload some to the RAM?
And how can I calculate what’s the max prompt it can handle with ~22GB of VRAM?

I’m baffled it works with a prompt of 500 lines but not with a prompt of 1500 lines. ( basically I’m just giving it an HTML page as input, it’s not that much tokens per line )

ptrblck · April 25, 2024, 5:29pm

I’m not deeply familiar with the model, but you could check how the sequence length is used inside it and estimate the memory using e.g. this approach.