As in title, I wonder how widely it is used.
Sometimes, it seems to have big impact on training transformer-based models.
But I can’t find it implemented in torch, huggingface transformers, tensorflow.
It’s only implemented in allennlp or opennmt.
It’s little bit wierd to me.
2 Likes
I am also interested in this. It seems that by googling, few papers use this scheduling policy.