I try to get embedding for protein sequences using Rostlab/prot_t5_xl_half_uniref50-enc. Some of the sequences are very long (more then 15000 characters). Calling the model with this kind of sequences causes OutOfMemoryError, like below
OutOfMemoryError: CUDA out of memory. Tried to allocate 42.46 GiB. GPU 0 has a total capacty of 14.75 GiB of which 3.80 GiB is free. Process 13042 has 10.95 GiB memory in use. Of the allocated memory 10.02 GiB is allocated by PyTorch, and 127.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Could you suggest the way/-s to utilise multiple GPUs (16GiB) for the model inference? Is it possible to have distributed inference in my case?
One way you could do it is just manually split the embedding to separate GPUs, then concat them back in your CPU ram or wherever you have enough space to fit it.
The above is not an elegant solution, and might be slow, but it will get the job done. Are you sending a batchsize of 1? If not, that might be a first approach.
Alternative methods to reduce the memory overhead would be to convert the model to 8 bit or 16 bit and try on one GPU. For example:
I have concerns regarding the first option as splitting the protein sequence could produce different embeddings compare to the whole sequence processing.
If it’s an embedding layer, it won’t make a difference. But if it’s an embedding and encoder, then you will need to process the attention with the entire sequence.
Do you know what context window they trained their model for? Typical context windows are 2,048.
What they likely mean is that it won’t give a model error for larger sizes. Not that it will effectively do well past those sizes. In fact, Google also released a LongT5 series of models designed to handle up to 16k context window:
The LongT5 paper results suggest regular T5 caps out at around 3k tokens for performance.