Is Noam scheduling widely used for training transformer-based models?

As in title, I wonder how widely it is used.
Sometimes, it seems to have big impact on training transformer-based models.
But I can’t find it implemented in torch, huggingface transformers, tensorflow.
It’s only implemented in allennlp or opennmt.
It’s little bit wierd to me.

2 Likes

I am also interested in this. It seems that by googling, few papers use this scheduling policy.

Do you mean this fairseq.optim.lr_scheduler.inverse_square_root_schedule — fairseq 0.12.2 documentation