For the last couple of days, I have been struggling with some code on how to train the french version of Bert (CamemBert) to classify some tweets (over 60.000) into 10 different classes. My biggest problem is that I don’t get any errors with the code below. It works fine and very well with other databases. With only 5 epochs, it gives higher results but with my database, it structures with a 0.5 F1 Score with even 10 epochs knowing that the 2 databases have the same structure which is even bizarre!

After mounting my drive and install all the needed packages, here is my code :

df = pd.read_csv('ALL.csv')
commentaire classement
0 Nul à chier Hate
2 il faut arreter de faire des videos clash roya… Hate
3 9 ans ou pas le branleur je me fous en slip ,j… Hate
4 Nul Hate

#change all the label to int
possible_labels = df.classement.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index


Neutral 50000
Hate 3000
Homophobia 3000
Mockery 3000
Racism 3000
Troll 3000
Moral Harassment 3000
Sexual Harassment 3000
Threat 3000
Insult 3000
Name: classement, dtype: int64

#split data : training \ test

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['classement', 'label', 'data_type']).count()

{‘Hate’: 0,
‘Homophobia’: 1,
‘Insult’: 2,
‘Mockery’: 3,
‘Moral Harassment’: 5,
‘Neutral’: 6,
‘Racism’: 7,
‘Sexual Harassment’: 8,
‘Threat’: 9,
‘Troll’: 10,
nan: 4}

#I think the problem cames from here
#I tied to convert the data into lists
msgTrain = df[df.data_type=='train'].commentaire.astype(str).values.tolist()
msgVal = df[df.data_type=='val'].commentaire.astype(str).values.tolist()

tokenizer = CamembertTokenizer.from_pretrained('camembert-base', do_lower_case=True)

encoded_data_train = tokenizer.batch_encode_plus(msgTrain,
                                                 return_attention_mask = True,
                                                 return_tensors = 'pt'

encoded_data_val = tokenizer.batch_encode_plus(msgVal, 
                                                 return_attention_mask = True,
                                                 return_tensors = 'pt'

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

model = CamembertForSequenceClassification.from_pretrained("camembert-base",

from import DataLoader, RandomSampler, SequentialSampler

batch_size = 32

dataloader_train = DataLoader(dataset_train,

dataloader_validation = DataLoader(dataset_val, 

from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),

epochs = 10

scheduler = get_linear_schedule_with_warmup(optimizer, 

from sklearn.metrics import f1_score

def f1_score_func(preds, labels):#F1
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):#accurancy
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()#all class in one vec

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')
import random

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 
model = 

def evaluate(dataloader_val):

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in dataloader_val:
        batch = tuple( for b in batch) 
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],

        with torch.no_grad():       
            outputs = model(**inputs)
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
    loss_val_avg = loss_val_total/len(dataloader_val) 
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
    return loss_val_avg, predictions, true_vals

for epoch in tqdm(range(1, epochs+1)):
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)

    for batch in progress_bar:

        batch = tuple( for b in batch)
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],

        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total += loss.item() 

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})

   , f'finetuned_BERT_epoch_{epoch}.model') #save epoch stat

    tqdm.write(f'\nEpoch {epoch}')#num epoch
    loss_train_avg = loss_train_total/len(dataloader_train)          
    tqdm.write(f'Training loss: {loss_train_avg}')
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

Results are all the same in every epoch like this :

That is all. Can anyone help me please ? maybe I did miss something important!

Could you try printing out the set of classes that are predicted by your network? That is, out of the 10 (or 11, if you count “nan” as a valid class) classes in your input, see how many the net reports as having seen. It is possible that your network is predicting only one class, or a couple of classes.

Also: I seem to remember that the F1-score applies only to binary classification? But your input has more than two classes?

It got even worse! I used my val set to see what are the predicted classes. I found that the net did not predicted any of them except for the neutral one (all of them! Which even make it harder to understand!) . The others i got just 0s!

As for the f1 score it is just calculated apon the precision and recall. Yes i have 10 class (11 if you count the nan!)

This is what I suspected, from the behaviour that you described in your post.

I am not familiar with BERT, but here are a couple of things you could try to fix this:

  • See if you are missing some normalization step in input processing. Specifically, a normalization step that makes all the input numbers small. An example of such a step from computer vision would be: dividing all pixel values by 255 so that they are all at most 1.
  • More generally: many networks work best when we scale the input features so that they are all within a comparable range. Is there such a step involved in training BERT? If so, have you done this step?
  • Reduce the initial learning rate till your net starts finding more than one class. From your code above you have lr set to 1e-3. Try reducing it to 1e-4, then 1e-5, and so on, till more classes start appearing in the predictions. Finding the right learning rate may require some trial and error.

Edited to add: From the denominator it looks like all your validation instances have their (correct/ground truth) label set to “Neutral”. This does not look like a good thing for multiclass validation. Usually we want the validation set to have all classes in the same proportion (as far as possible) as the training set.

I am not familiar with BERT, but here are a couple of things you could try to fix this:

  • See if you are missing some normalization steps in input processing. Specifically, a normalization step that makes all the input numbers small. An example of such a step from computer vision would be: dividing all pixel values by 255 so that they are all at most 1.

I really appreciate what you are doing with me. Thanks!!
indeed I did not do any normalization nor preprocessing to my data I just take them as they are for tweeter with their emojis and all the characters that you can find in comments! but I don’t think that is the problem no? Camembert is trained with 148GB of text! well, I am not sure but I just supposed that!

  • More generally: many networks work best when we scale the input features so that they are all within a comparable range. Is there such a step involved in training BERT? If so, have you done this step?

Yes there is I already put it in the parameters of the model

encoded_data_train = tokenizer.batch_encode_plus(msgTrain,
                                                 return_attention_mask = True,
                                                 max_length=50, #Here it is it can be mac 512 and I already change it 
                                                   #many time but it gets the model training very slowly
                                                 return_tensors = 'pt'
  • Reduce the initial learning rate till your net starts finding more than one class. From your code above you have lr set to 1e-3. Try reducing it to 1e-4, then 1e-5, and so on, till more classes start appearing in the predictions. Finding the right learning rate may require some trial and error.

I also did this :roll_eyes: but I got the same results!

Edited to add: From the denominator it looks like all your validation instances have their (correct/ground truth) label set to “Neutral”. This does not look like a good thing for multiclass validation. Usually we want the validation set to have all classes in the same proportion (as far as possible) as the training set.

I am not sure what do you mean. can you please explain more ?
Thank you very much!!

As I said, I am not familiar with how BERT models work. The model may expect that you normalize the input in some manner. The model having been trained on large amounts of data does not mean that you don’t have to preprocess the input. For example, every pre-trained model from TorchVision expects that its input is normalized in a very particular way:

All pre-trained models expect input images normalized in the same way, 
i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where 
H and W are expected to be at least 224. The images have to be loaded
in to a range of [0, 1] and then normalized using 
`mean = [0.485, 0.456, 0.406]` and `std = [0.229, 0.224, 0.225]`.

Many of these vision models are trained on much more than 148GB of images. Camembert may also expect some such input normalization. You have to check its documentation to see if this is the case.

This was a mistake I made in interpreting the second 7501 in:

Class: Neutral
Accuracy: 7501/7501

After looking over the code which prints this, I think this should be fine (provided the logic there is OK; I didn’t look too closely).

Given that Camembert is a well-regarded pre-trained model, my current hypothesis is that the issue is with how you feed it your input. Most likely, you are missing to enforce some requirements on the input. So I would suggest that you go carefully through the documentation for Camembert, and compare your input code with other known successful examples of applying Camembert.

Solved!! After all i just put the wrong type for the encoder here :

msgTrain = df[df.data_type=='train'].commentaire.astype(str).values.tolist()
msgVal = df[df.data_type=='val'].commentaire.astype(str).values.tolist()

It shouldn’t bee a list!!

I am not quite sure with wht is happening but … it worked!!

Thank you so much!

