How to create a scheduler which increases and decreases based on a certain LR like in Bert/Roberta

I’m trying to recreate the learning rate schedules in Bert/Roberta, which start with a particular optimizer with specific args, linearly increase to a certain learning rate, and then decay with a specific rate decay.

Say that I am trying to reproduce the Roberta pretraining, described below:

BERT is optimized with Adam (Kingma and Ba,
2015) using the following parameters: β1 = 0.9,
β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01. The learning rate is warmed up
over the first 10,000 steps to a peak value of
1e-4, and then linearly decayed. BERT trains
with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are
pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens.

I would need to start with a learning rate of 1e-6 , warmup to 1e-4 in 1000 steps, then let the weight decay continue for the rest of the training.

I see that there are some learning rate scheduler here,

But they don’t seem to have the two phases as described in the passage above, or start/stop at certain learning rates.

Is there another way to reproduce the roberta learningrate schedule?

I’m looking at the fairseq repo, but I’m having a tricky time following the code, and I’m not sure how to copy specific parts into the trainer code I am working on

I have looked into how Fairseq does the warmup

I am having a tricky time following the code, but it looks like the warmup is handled with PolynomialDecaySchedule (since there is a ‘@register_lr_scheduler(“polynomial_decay”)’, and ‘polynomial_decay’ is an arg for the pretraining command) , by passing an optimizer object, and specific args

I can deduce most of the args from the ‘Train Roberta Base’ command here

Except for power, which I am guessing will just be left to the default of 1.

I am also wondering the decay is ‘and then linearly decayed’ as mentioned in the Roberta paper. I see that there is already a decay happening with AdamW, since its decay rate is set at 0.01, but I don’t think that’s a linear decrease, and the paper does mention that there is a linear decrease.

Also, the PolynomialDecaySchedule class mentions that its supposed to be for decaying the learning rate (“Decay the LR on a fixed schedule.”). So it sounds like the class is also handling this linear decay, based on the end_learning_rate arg.

It seems to be that no arg is passed in for that, so it uses the default of zero. I’m not so sure of this though, I feel like it may have been mentioned in the paper, and its a little tricky to see what exactly is being passed into PolynomialDecaySchedule , because it doesn’t seem like that class is being instantiated in the code. It seems to be instantiated a different way which I am not able to figure out.

Maybe the end learning rate is zero after all, since that’s how it’s done in Bert, which Roberta is based on.

Which also shows power =1.

I am attempting to make a custom scheduler which replicates the Roberta warmup, so far I came up with this, based on Huggingface’s linear warmup scheduler

def get_linear_schedule_with_warmup_with_peak(optimizer, num_warmup_steps, num_training_steps, init_lr, peak_lr, last_epoch=-1):

    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return (float(current_step) / float(max(1, num_warmup_steps)))*(peak_lr/init_lr)
        return max(
            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))

    return LambdaLR(optimizer, lr_lambda, last_epoch)

Which is not quite exact, since the optimizer is AdamW, which has its own weight decay. For it to be exact, I need init_lr to be replaced by the current optimizer learning rate, but from the documentation on LambdaLR, lr_lambda seems to only take in an int.

So I need a way to adjust the optimizer’s learning rate based on its current learning rate.

1 Like