I tried laguage model sample of Example at https://github.com/pytorch/examples/tree/master/word_language_model.
LSTM setting performs well. It shows the perplexity of around 100. And generated English sentences look good. However Transformer setting performs bad. It shows the perplexity of around 1000. Transformer is highly appreciated these days, and those results look strange.
I may miss something, but this example doesn’t make sense if it tried to show a viability of the PyTorch implementation of transformer. Are there any suggestions on any changes of hyper parameter or something for Transformer model to perform comparatively with RNNs?