Transfer learning (or fine-tuning) pre-trained model on non-text data

corvusMidnight · December 12, 2022, 5:04pm

I am currently fine-tuning a sentiment analysis bert-based model using PyTorch Trainer from hugging face. So far, so good.

I have easily managed to fine-tune the model on my text data. However, I’d like to conduct an ablation study to see how the inclusion of linguistic features impacts the model’s performance.

In other words, how the inclusion of, e.g., comment length, type-to-token ratio, and other features (stored in my dataset in a separate column) affects the performance of my model.

This is what my data kind of looks like:

Text	Type-token ratio
Hi, my name is…	1.0
I cannot believe I did not…	0.95

In the specific case above, for instance, I would like to fine-tune the model on the text column but also on the Type-token ratio one.

I know that some people concatenate the two columns into a string, but I am not sure that is the correct method. Is there a more methodologically sound way of doing this?

I was not able to find much information about it.

This is my code:

dataset = pd.read_csv('/content/gdrive/MyDrive/.../data_train.csv')
train_roberta = dataset[['text_lower', 'label']].sample(frac=0.75)
validation_roberta = dataset[['text_lower', 'label']].drop(train_roberta.index)
train_roberta = train_roberta.dropna()
validation_roberta = validation_roberta.dropna()
train = Dataset.from_pandas(train_roberta, preserve_index=False)
validation = Dataset.from_pandas(validation_roberta, preserve_index=False)

tokenizer = AutoTokenizer.from_pretrained("a_model/a_bert_like_model")

def tokenize_function(example):
    return tokenizer(example["text_lower"], padding="max_length", truncation=True)

tokenized_train_dataset = train.map(tokenize_function, batched=True)
tokenized_test_dataset = validation.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
import os
os.environ["WANDB_DISABLED"] = "true"
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")   # default arguments for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("a_model/a_bert_like_model", num_labels=3)


def compute_metrics(eval_preds):   # compute accuracy and f1-score
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(   # specifying trainer class
    model,
    training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()  # starts fine-tuning