Fine tunning a model I get: CUDA out of memory

Hi, I’m trying to finetune a model (tried different ones) in this case Dolphin Mistral. I’m using in my Windows 11 an NVIDIA 4090. My settings are:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

My torch version is:2.1.2 and the torch.cuda.is_available(): True

Not running anything else. Cleaned cache before the training:


My GPU has plenty of space. Nothing else is running.

But when running the python script for finetuning I get:

Error during training: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 15.99 GiB of which 0 bytes is free. Of the allocated memory 30.12 GiB is allocated by PyTorch, and 17.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Even if I’ve set in the “System Variables” from my “Enviroment Variables”:

PYTORCH_CUDA_ALLOC_CONF max_split_size_mb:32

What is wrong, why it isn’t working? It should work I have enough GPU power.

Appreciate any help.

These models are generally huge and often won’t fit in a single GPU. Either you need to decrease your batch size or freeze the initial layers of the model.

Thanks for answering @FaKa. I’ve reduced the batch_size to 1 and incremeted the gradient_accumulative to make it so much as a could be accepted for the fine tuning but no success. I’m not sure, but if it is shown for fine tuning why wouldn’t be possible to do it on my computer with the NVIDIA 4090? Is it still not enough?

Appreciate your help.

There isn’t a short answer to your question, I’d recommend you read up more on the different sources of VRAM when finetuning a model including model, gradients, optimizer state and activation memory.

Here’s a guide I wrote and there are some more online that might be helpful

Thanks a lot for your response. I’ve looked at the doc shared on your Twitter account.

Of course there has to be something wrong on my code, but even doing so low the batch size as 1 and the gradient acumulation to 16, it wonders me that my NVIDIA 4090 can’t process the fine tunning. Weird because there are Github repos like axolotl or VS extensions like Windows AI Studio that do fine tunning for the same model I’m trying to do. So, not really sure that it isn’t possible to fine tune. But, at the same time don’t know what else should I do to solve the “CUDA out of memory”.

Here the code:

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
import json
from import Dataset
import torch

print("Script started")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device set:", device)


    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
    tokenizer = AutoTokenizer.from_pretrained(
    print("Model and tokenizer loaded successfully.")
except Exception as e:
    print(f"Error loading model or tokenizer: {e}")

# Reduce the tokenization length even further
tokenizer.model_max_length = 64
model.config.max_length = 64

    print("Loading data...")
    # Load only a very small subset of the data
    with open('processed_lawyer_data.json', 'r') as file:
        texts = json.load(file)[:2]  # Only load 2 examples for testing
    print("Data loaded successfully.")
except Exception as e:
    print(f"Error loading data: {e}")

    print("Tokenizing texts...")
    inputs = tokenizer(texts, return_tensors='pt', padding=True,
                       truncation=True, max_length=tokenizer.model_max_length)
    print("Texts tokenized successfully.")
except Exception as e:
    print(f"Error during tokenization: {e}")

class LawyerDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings['input_ids'])

    dataset = LawyerDataset(inputs)
    print("Dataset created successfully.")
except Exception as e:
    print(f"Error creating dataset: {e}")

    print("Defining training arguments...")
    training_args = TrainingArguments(
        per_device_train_batch_size=1,  # Keep batch size as 1
        gradient_accumulation_steps=16,  # Increase gradient accumulation steps
        warmup_steps=10,  # Reduced warmup steps
    print("Training arguments defined.")
except Exception as e:
    print(f"Error setting training arguments: {e}")

    print("Initializing trainer...")
    trainer = Trainer(
    print("Trainer initialized.")
except Exception as e:
    print(f"Error initializing trainer: {e}")

    print("Starting training...")
    print("Training completed successfully.")
except RuntimeError as e:
    print("Cleaning up CUDA memory...")
    print(f"Error during training: {e}")
except Exception as e:
    print(f"Error during training: {e}")

print("Script ended")

Appreciate a lot your help.

try in order

  1. load a quantized variant like so TheBloke/dolphin-2.2.1-mistral-7B-GPTQ · Hugging Face
  2. Use small sequence lengths and batch sizes (looks like you’re doing this already but be more aggressive go with 1 just to see if you can get rid of the OOM)
  3. When doing finetuning check with optimizer you’re using if using adam try SGD instead

If you’re doing fine-tuning, you can keep everything in bfloat 16.

That should cut the footprint in half without losing any fidelity.

Alternatively, you can use the following huggingface recommendations for fine-tuning weights in quantized mode(4-bit):

Note, the above is a bit more complicated than just simply doing torch.bfloat16() before training LoRA weights on top.

1 Like