Advice for training Transformer for translation

I am training a standard transformer (6 encoder blocks, 6 decoder blocks, d_model=512, heads=8, dropout=0.1) on the Helsinki-NLP/opus_books EN-ES dataset from Huggingface. I am using an Adam optimizer with learning rate 1e-4 and batch size of 16. After about 30 epochs, I have gotten my loss down to ~2.0 using nn.CrossEntropyLoss with label smoothing set to 0.1. I am experienced training MLPs but not transformers and was wondering if anybody could suggest things I could try to further decrease the loss and accelerate convergence?

Some ideas I’m considering:

  • Using AdamW
  • Using Cosine learning rate scheduler
  • Increase number of attention heads
  • Increase d_model

Note, I am following Umar Jamil’s from-scratch implementation of the transformer in PyTorch, not PyTorch’s built-in transformer.