I’m trying to recreate the learning rate schedules in Bert/Roberta, which start with a particular optimizer with specific args, linearly increase to a certain learning rate, and then decay with a specific rate decay.
Say that I am trying to reproduce the Roberta pretraining, described below:
BERT is optimized with Adam (Kingma and Ba,
2015) using the following parameters: β1 = 0.9,
β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01. The learning rate is warmed up
over the first 10,000 steps to a peak value of
1e-4, and then linearly decayed. BERT trains
with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are
pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens.
I would need to start with a learning rate of 1e-6 , warmup to 1e-4 in 1000 steps, then let the weight decay continue for the rest of the training.
I see that there are some learning rate scheduler here,
https://pytorch.org/docs/stable/optim.html
But they don’t seem to have the two phases as described in the passage above, or start/stop at certain learning rates.
Is there another way to reproduce the roberta learningrate schedule?
I’m looking at the fairseq repo, but I’m having a tricky time following the code, and I’m not sure how to copy specific parts into the trainer code I am working on