We have occasionally seen many tricks which were used in various papers without any particular explanation.
For eg: in Attention is all you need you see the embedding multiplied by sqrt(embed_dim)
Are there any articles or notes which talks about various techniques that people have tried and what is the motivation behind these adhoc tuning tricks?
well, there are some ‘general’ tricks for training, just as the ‘Cosine Learning Rate Decay’ what is actually just a smoother decay as well as learning rate warm up (to prevent unwanted ‘weight chaos’ caused by starting with a too high learning rate; therefore learning rate warmup kind of aligns). For quite deep models in particular, you may want to read about ‘residual blocks’ (https://arxiv.org/pdf/1512.03385.pdf). However, with respect to NLP, there is ‘attention’ for seq2seq models (as you have already mentioned) as well as ‘beam search’, for instance. I can recommend with good conscience ‘Andrew Ng’ for many methods. Depending on your task, you may also want to consider implementing ‘MTL’.