OutOfMemoryError for T5EncoderModel

I try to get embedding for protein sequences using Rostlab/prot_t5_xl_half_uniref50-enc. Some of the sequences are very long (more then 15000 characters). Calling the model with this kind of sequences causes OutOfMemoryError, like below

OutOfMemoryError: CUDA out of memory. Tried to allocate 42.46 GiB. GPU 0 has a total capacty of 14.75 GiB of which 3.80 GiB is free. Process 13042 has 10.95 GiB memory in use. Of the allocated memory 10.02 GiB is allocated by PyTorch, and 127.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Could you suggest the way/-s to utilise multiple GPUs (16GiB) for the model inference? Is it possible to have distributed inference in my case?

One way you could do it is just manually split the embedding to separate GPUs, then concat them back in your CPU ram or wherever you have enough space to fit it.

import copy
def split_embedding_load(sequence, embedding, device1, device2, split: int):
    embedding1 = embedding.to(device1)
    seq1 = encoder1(sequence[:, :split].to(device1))
    embedding2 = copy.deepcopy(embedding2).to(device2)
    seq2 = embedding2(sequence[:,split:].to(device2))
    return torch.cat([seq1.to('cpu'), seq2.to('cpu')], dim=1)

The above is not an elegant solution, and might be slow, but it will get the job done. Are you sending a batchsize of 1? If not, that might be a first approach.

Alternative methods to reduce the memory overhead would be to convert the model to 8 bit or 16 bit and try on one GPU. For example:

embedding = nn.Embedding(10000, 512)
embedding = embedding.to(dtype = torch.bfloat16)
dummy_input = torch.randint(0,10000,(12,100))

print(embedding(dummy_input).dtype)

For 8-bit, you can find more information here: Quantization — PyTorch 2.1 documentation

Thank you very much.

I have concerns regarding the first option as splitting the protein sequence could produce different embeddings compare to the whole sequence processing.

If it’s an embedding layer, it won’t make a difference. But if it’s an embedding and encoder, then you will need to process the attention with the entire sequence.

Do you know what context window they trained their model for? Typical context windows are 2,048.

I have check to maximum sequence length with which model can be used and it is 512 characters, however I also found out in the next thread T5 Model : What is maximum sequence length that can be used with pretrained T5 (3b model) checkpoint? · Issue #5204 · huggingface/transformers · GitHub that T5 models can handle tokens much longer then they where trained.
I tried embedder on the same protein sequence but of different length (truncated to 512, 1024 and 2048 chars) and model returned completely different embeddings for each of them. Thus, concatenations won’t work, unfortunately.

What they likely mean is that it won’t give a model error for larger sizes. Not that it will effectively do well past those sizes. In fact, Google also released a LongT5 series of models designed to handle up to 16k context window:

The LongT5 paper results suggest regular T5 caps out at around 3k tokens for performance.