Model predicts most frequent words

Hello, i try to make bert-like model with nn.TransformerEncoder but when i predict masked word in sequence, usually its get the most frequent words in vocab.

When I thought about, it kind of made sense. Predicting the most frequent class for imbalanced data gives high accuracy ‘for free’.

What can i do about this situation, i tought changing loss function?