RNN outperforming Transformer for dialogue task

I have been experimenting with different model architectures for dialogue modeling, and I am currently working with the Transformer. I am seeing a big difference between an RNN Seq2Seq model’s performance vs. the Transformer’s performance in terms of both test_ppl and test_wer. Both models use roughly the same number of parameters (~70 million). Is this to be expected or strange? I am training on the Cornell Movie dialogue dataset just for experimentation, I will be moving to a larger dataset once I am satisfied my setup works properly.

Here are the graphs of each metric, the red is the RNN and the purple is the transformer:

Train PPL:

Test PPL:

Test WER: