I have been experimenting with different model architectures for dialogue modeling, and I am currently working with the Transformer. I am seeing a big difference between an RNN Seq2Seq model’s performance vs. the Transformer’s performance in terms of both test_ppl and test_wer. Both models use roughly the same number of parameters (~70 million). Is this to be expected or strange? I am training on the Cornell Movie dialogue dataset just for experimentation, I will be moving to a larger dataset once I am satisfied my setup works properly.
Here are the graphs of each metric, the red is the RNN and the purple is the transformer: