Tricks to improve model training

We have occasionally seen many tricks which were used in various papers without any particular explanation.

For eg: in Attention is all you need you see the embedding multiplied by sqrt(embed_dim)

Are there any articles or notes which talks about various techniques that people have tried and what is the motivation behind these adhoc tuning tricks?

Hi @thyr,

well, there are some ‘general’ tricks for training, just as the ‘Cosine Learning Rate Decay’ what is actually just a smoother decay as well as learning rate warm up (to prevent unwanted ‘weight chaos’ caused by starting with a too high learning rate; therefore learning rate warmup kind of aligns). For quite deep models in particular, you may want to read about ‘residual blocks’ ( However, with respect to NLP, there is ‘attention’ for seq2seq models (as you have already mentioned) as well as ‘beam search’, for instance. I can recommend with good conscience ‘Andrew Ng’ for many methods. Depending on your task, you may also want to consider implementing ‘MTL’.