I am loading a model in my GPU and try to check how it performs when generating text. More specifically, I have the following code:
model_name = "microsoft/xtremedistil-l12-h384-uncased" # Find popular HuggingFace models here: https://huggingface.co/models
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map = "cuda",
torch_dtype = "auto",
trust_remote_code = True,
output_attentions=True
)
# Create a pipeline
generator = pipeline(
"text-generation",
model = model,
tokenizer = tokenizer,
return_full_text= False,
max_new_tokens = 500,
do_sample = False
)
input_text = "What is the co-capital of Turkey according to citizens opinions: "
inputs = tokenizer.encode(input_text, return_tensors='pt') # Tokenize input text
input_ids = tokenizer(input_text, return_tensors = "pt").input_ids
# tokenize the input prompt
input_ids = input_ids.to("cuda")
output = generator(messages)
However, when trying to run this code I receive the following error:
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB. GPU 0 has a total capacity of 23.57 GiB of which 5.88 MiB is free. Including non-PyTorch memory, this process has 23.55 GiB memory in use. Of the allocated memory 20.55 GiB is allocated by PyTorch, and 2.71 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
What is this message is telling me? Can I overcome this issue?