Transformer Classification (Overfitting ?)

I have built a simple transformer classifier consisting of multiple TransformerEncoderLayers.
I have tried a variety of different values for hyperparameters (model dimension, number of heads, hidden size, number of layers, initial learning rate and batch size). I have also tried different dropout probabilities (0.2, 0.5 and 0.75). My vocabulary size is 100,000.
All different combinations have lead to a following loss pattern:loss_graph_0

blue graph: training loss
orange graph: validation loss

Of course, they differ by a tiny amount, but the overall result has always been the same.
I have also tried another Dataset to make sure the data is not the problem, but the results were even worse.
The training loss just stays at about the same value from the 15th unit on until the next epoch starts from unit 22 on, where the training loss suddenly decreases by about 0.02, but then stays the same again for the entire epoch.
The validation loss does not change significantly from unit 15 on.
It looks very much like overfitting, but it seems quite strange too me, that the model starts overfitting that early with such a bad loss.

I appreciate any hints or help, if somebody has already experienced something like this.