I need to compare full finetune and LoRA on T5 model on summarization task. The problem is that on my GPU finetune takes ~4.5gb of VRAM and 1 hour to train. And LoRA takes the same amount of memory and time with the same parameters. Sometimes it takes even 100-200mb more according to pytorch. Am I missing something? Do I need to do it different way?
Here’s the snippet of my code. I used to have my own training loop but then opted for Trainer class, though it made no difference.
Finetune setup:
tokenizer = AutoTokenizer.from_pretrained('t5-base')
model = AutoModelForSeq2SeqLM.from_pretrained('t5-base').to(device)
train_args = TrainingArguments(
gradient_accumulation_steps=2,
gradient_checkpointing=True,
learning_rate=lr,
num_train_epochs=epochs,
warmup_steps=warmup,
optim='adafactor',
)
trainer = Trainer(model, train_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()
LoRA setup:
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q", "v"],
lora_dropout=0.1,
bias="none",
)
tokenizer = AutoTokenizer.from_pretrained('t5-base')
model = AutoModelForSeq2SeqLM.from_pretrained('t5-base').to(device)
model = get_peft_model(model, config)
train_args = TrainingArguments(
gradient_accumulation_steps=2,
gradient_checkpointing=True,
learning_rate=lr,
num_train_epochs=epochs,
warmup_steps=warmup,
optim='adafactor',
)
trainer = Trainer(model, train_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()
Out of curiosity I tried to freeze all weights except in lm_head (~11%) with require_grad = False
and no difference with finetune.