Transformer learning gets worse with more encoder blocks

I am following this tutorial and tweaked it slightly to my data:

In my data the input is a set of letters (80 amino acid codes) where the vocabulary is 21 (20 amino acids and one padding letter) and the output is a single continuous number. What I did was I dropped masking, and changed the decoder to output a single value instead of ntokens.

The parameters that I used:
ntokens = 22 # size of vocabulary, needs to be even for some reason
emsize = 30 # embedding dimension
d_hid = 256 # dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2 # number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2 # number of heads in nn.MultiheadAttention
dropout = 0.1 # dropout probability

I have around 1 million data points, which is not that much. Now when I train this network I get a relatively nice loss curve, it trains to around ~1.5 MSE.

However, when I try to increase the nlayers to a higher number (I read in other tutorials that it should be around 6) my network fails to train. Even with 4 layers the MSE loss stay the same for 100 epochs.

What am I missing? I thought if I have low amount of data, the network should over train and I should get a very nice train loss, and bad validation/test loss, however this is not the case.