Hello,
I am currently studying how I can successfuly fine-tune some LLMs after pruning them in a certain way, so that their performance on a set of tasks is restored as much as possible. In particular, I am pruning the following models:
- Llama-2-7b-hf
- Llama-2-13b-hf
- DeepSeek-R1-Distill-Llama-8B
- DeepSeek-R1-Distill-Qwen-14B
- DeepSeek-R1-Distill-Qwen-32B
I am evaluating them on the tasks arc_challenge, arc_easy, hellaswag, lambada_openai, openbookqa, piqa and winogrande. I use lm-eval (GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models.), which is very convenient. I prune MLP layers, that is, I either prune mlp.gate_proj, mlp.up_proj or mlp.down_proj, or a combination of them, and I do it by setting some square-blocks to zero within the weight matrix (512x512 or 256x256). I need the models to regain the lost performance, but I am having a hard time doing so. Here is the training script that I am currently using:
lora_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0.1,
target_modules=['q_proj','k_proj','v_proj','output_proj'],
)
model = get_peft_model(model, lora_config)
optimizer = AdamW(model.parameters(), lr=2e-4, weight_decay=0.01)
training_args = transformers.TrainingArguments(
output_dir="./results",
logging_strategy="epoch",
eval_strategy="steps",
eval_steps=500,
learning_rate=1e-4,
auto_find_batch_size=True,
gradient_accumulation_steps=1,
num_train_epochs=2,
eval_on_start = True,
bf16 = True,
log_on_each_node=False
)
trainer = transformers.Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"],
eval_dataset=lm_datasets["validation"],
processing_class=tokenizer,
data_collator=data_collator,
optimizers=(optimizer, None)
)
trainer.can_return_loss = True
trainer.train()
# Merge LoRA structure
model = model.merge_and_unload()
I have tried training on 80000 samples of C4 dataset (sometimes 160000, but the results barely improve). I use 4 GPUs H100 64GB. The way I preprocess my data is as follows:
raw_datasets = load_dataset(datasets_path, split=[f'train[0:{train_number_samples}]', f'validation[:{validation_number_samples}]'], cache_dir="./cache_training")
raw_datasets = DatasetDict({
"train": raw_datasets[0],
"validation": raw_datasets[1],
})
column_names = list(raw_datasets["train"].features)
text_column_name = "text" if "text" in column_names else column_names[0]
def tokenize_function(examples):
return tokenizer(examples[text_column_name])
tokenized_datasets = raw_datasets.map(
tokenize_function,
batched=True,
remove_columns=column_names,
)
# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size_texts.
def group_texts(examples):
block_size_texts = 1024
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size_texts) * block_size_texts
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size_texts] for i in range(0, total_length, block_size_texts)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
)
Does anyone know if everything looks fine here? The validation loss considerably decreases, which means that the training process is indeed doing something. I have tried removing the blocks with the lowest L2 norm, hoping that I would not harm the model as much, but I see no difference compared to randomly removing them. So I thought that I can try training without Lora (1), keeping the initial and final layers intact (2) or increase the target_modules list (3), but I am not sure if this will be effective.
Can anyone give me some better ideas?
Thank you!