well, there are some ‘general’ tricks for training, just as the ‘Cosine Learning Rate Decay’ what is actually just a smoother decay as well as learning rate warm up (to prevent unwanted ‘weight chaos’ caused by starting with a too high learning rate; therefore learning rate warmup kind of aligns). For quite deep models in particular, you may want to read about ‘residual blocks’ (https://arxiv.org/pdf/1512.03385.pdf). However, with respect to NLP, there is ‘attention’ for seq2seq models (as you have already mentioned) as well as ‘beam search’, for instance. I can recommend with good conscience ‘Andrew Ng’ for many methods. Depending on your task, you may also want to consider implementing ‘MTL’.