Transformer doesn't fit to English to English Dialogue

I’m new to transformers so maybe this is a dumb question but I’ve been trying to get a normal Seq2Seq transformer model to fit to a few examples. The examples I’m using are English to English conversations, but I don’t think that matters too much. Obviously this will be hopelessly overfit, but for some reason I can’t seem to get larger models (>30 million parameters) to fit even on a few examples.

I’d assume that a larger model would have a better ability to fit (overfit in this scenario), so I’m not sure what’s going on. Maybe I’m thinking about this wrong, please correct me if so.

The expected behavior is that it will train down to a very low loss and respond to the training examples near perfect. This is not the case as the loss is stuck around 0.2 and it ends up outputting the same for any input.

I created the Transformer model with my own code which is 350 lines, so a bit much to post the entire code here but I believe the same effect could be seen with the standard pytorch TransformerEncoder and TransformerDecoder. Here is a link to the entire file I’ve used:

First, I would suggest to start a model with fewer parameters to prove the idea. It saves time to see the preliminary results before wasting time to train a complex model.

Second, IMO, it would be great to start with the transformer/MultiheadAttention modules in Pytorch library since they are developed and maintained by many people. This should avoid some mistakes during some proof tests.

I’ve tried to scale it down, and it works. I am trying to train on only four one word examples, so I think there has to be an error in my code.

I’ve also switched to using the pytorch TransformerEncoder and TransformerDecoder. The same effect still happens.

Here is the new file.
The train.from and files are just text files where each line is a new example.