Understand Memory Usage of Pytorch Tensors for Inference

I plan on translating large text corpora from various languages to english with Large Language Models. Therefore, I tried messing around a bit to see the computational limits of my machine.

Specifically, I try to translate a text corpus by using the No Language Left Behind translation model of Meta with the following code:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
device = "cpu"
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M").to(device)

def translationPipeline(text):
    input_ids = tokenizer.encode(text, return_tensors="pt").to(device)
    outputs = model.generate(input_ids,forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"])
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded

# c = df_text.iloc[1,:].body.apply(translationPipeline)
with torch.no_grad():
    c = translationPipeline(df_text.iloc[0,:].body)

Here, df_text.iloc[0,:].body is one text corpus that equates to roughly 150,000 characters. The tokenizer transforms this to a Tensor of shape (1, 42809). If I execute the above code, I get a out-of-memory error on my 256 GB RAM machine.

I am aware that the token length of my text corpus exceeds the maximum length for this model. However, I am still wondering where this high RAM usage comes from. In its raw form, 150,000 characters should of course not consume 250 GB of memory.

Where does the additional RAM required come from? How much RAM would be needed to translate the whole text corpus at once?