Bert finetuning for binary classification with special tokens evaluates badly

Peter4 · May 19, 2024, 3:02pm

I’m fine tuning a BERT model for binary classifcation, before the training process starts I add some tokens which help with explainability on the task, but I’m getting really bad scores on the evaluation process. I’m relatively new to fine tuning bert models and I’m thinking I’ve messed something up in my training or evaluation function of the model, or the tokens have not been added properly. The self.model is a BertForSequenceClassification with no additional layers added to it. The tokenizer is not trained, both model and tokenizer are base bert.

This is the code I’m using for adding tokens and training the model:

def add_tokens(self):
        try:
            logger.print_and_log("Adding Extra Tokens ...", "green")
            self.tokenizer.add_tokens(self.extra_tokens, special_tokens = True)
            self.model.resize_token_embeddings(len(self.tokenizer))
        except Exception as e:
            logger.print_and_log("Error: " + str(e), "red")

    def train(self, epochs=3):
        try:
            logger.log("Training Model ...")
            if (self.extra_tokens):
                self.add_tokens()
            encodings = self.tokenizer(self.train_data, truncation=True, padding=True, return_tensors="pt")
            dataset = TensorDataset(encodings['input_ids'], encodings['attention_mask'], torch.tensor(self.train_labels))
            dataloader = DataLoader(dataset, sampler=RandomSampler(dataset), batch_size=8)
            optimizer = Adam(self.model.parameters(), lr=2e-5)
            loss_fn = torch.nn.CrossEntropyLoss()

            self.model.train()
            for epoch in range(epochs):
                logger.print_and_log("Epoch: " + str(epoch), "green")
                for batch in tqdm.tqdm(dataloader):
                    optimizer.zero_grad()
                    input_ids, attention_mask, labels = batch
                    input_ids = input_ids.to(self.device)
                    attention_mask = attention_mask.to(self.device)
                    labels = labels.to(self.device)
                    outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                    loss = loss_fn(outputs.logits, labels)
                    outputs = torch.argmax(outputs.logits, dim=1).cpu().detach().numpy()
                    loss.backward()
                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                    optimizer.step()
            logger.log("Training Complete for " + self.model_name + ".")
        except Exception as e:
            logger.print_and_log("Error: " + str(e), "red")

konan009 · May 19, 2024, 4:30pm

Hello,

I recently work on BERT and RoBERTa. I can help you with this one.

Looking at the code of BertForSequenceClassification from huggingface, your code should definitely looks like this :

classifier_output = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
optimizer.zero_grad()
classifier_output.loss.backward()
optimizer.step()

BertForSequenceClassification already counted your loss inside of it using CrossEntropyLoss, therefore you can just use the loss property from SequenceClassifierOutput to backpropagate then optimize your network.

Let me know if this work.

Also, make sure you use the tokens [CLS], [SEP] and [PAD] correctly.

Peter4 · May 19, 2024, 4:40pm

I corrected the loss function and it somewhat helped the metrics of the model but the issue persists. Could the issue lie in using special domain specific tokens derived from transcripts like CHAT protocol ? For example, in the preprocessing stage, I convert specific characters like &=um to a token called [CHA FILLER]. I then add these types of tokens to my tokenizer and resize the token embeddings

konan009 · May 20, 2024, 12:18pm

I see, you added tokens on BERT. My experience of using it is this, I fine-tune it on sentiment analysis task. In my case I didn’t add tokens in it, I just use it as is and I got good results from it.

In your case, since you added tokens, I assume the tokens is added on self.model.bert.embeddings.word_embeddings. To give more context, how many tokens did you add on the model ? and would you mind to give me a sample of your data together with its labels? Just in case I can still help.

Peter4 · May 20, 2024, 1:14pm

I added 5 or 6 tokens, but I’m afraid I cannot disclose the data used. I can give an example of text though. An example may be “Hi [CHA PAUSE] [CHA FILLER] how are you ? [CHA PAUSE]”. [CHA PAUSE] indicates that the user paused for a second and [CHA FILLER] indicates that the user used a language filler such as “uh” or “uhm”. This should classify as 1 let’s say, for the user being shy. I need these tokens for explainability reasons, they help a lot with using LIME and other explainability methods.

You can refer to the function add_tokens(self) for the way I added them. Thank you so much for your time!

konan009 · May 21, 2024, 8:19am

Got it. Looking at your input, it is mostly special tokens that you added, BERT might not be able to contextualize your input. I think it might be better to use the existing token representations from the BERT model itself, and then assign those values to your special token. For example :

Your [CHA PAUSE] seems similar / same context in terms of use to a comma “,” or “paused”. You may wanna use those BERT Token Embeddings.

You can assign the values this way :
Let’s say [CHA PAUSE] special token is on index 30522, and I assume you use BERT-uncased variant and its “paused” token is on index 1007.

self.model.bert.embeddings.word_embeddings.weight[30522] = self.model.bert.embeddings.word_embeddings.weight[1007]

You may do this method with your other special token as well.
Also, I think you can also try this way since BERT uses Byte-Pair Encoding (BPE), you can combine word embeddings and then get their average column-wise.
Let’s say you want to use the token representation of group of tokens like (user paused) and assign it on your [CHA PAUSE]. You can do it like this way :

with torch.no_grad():
    inputs = tokenizer("( user paused )", return_tensors="pt")
    special_token = model.bert.embeddings.word_embeddings(inputs.input_ids[0,1:-1]).mean(dim=0)
    model.bert.embeddings.word_embeddings.weight[30522] = special_token

No problem. Let me know if these methods help you. I’m happy to help
.
Edit:
I’m not particularly sure if the 2nd method will work.But, i’m curious what will be the performance if you did it that way . Regards