How to create a scheduler which increases and decreases based on a certain LR like in Bert/Roberta

Santosh-Gupta · November 18, 2020, 1:52am

I’m trying to recreate the learning rate schedules in Bert/Roberta, which start with a particular optimizer with specific args, linearly increase to a certain learning rate, and then decay with a specific rate decay.

Say that I am trying to reproduce the Roberta pretraining, described below:

BERT is optimized with Adam (Kingma and Ba,
2015) using the following parameters: β1 = 0.9,
β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01. The learning rate is warmed up
over the first 10,000 steps to a peak value of
1e-4, and then linearly decayed. BERT trains
with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are
pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens.

I would need to start with a learning rate of 1e-6 , warmup to 1e-4 in 1000 steps, then let the weight decay continue for the rest of the training.

I see that there are some learning rate scheduler here,

https://pytorch.org/docs/stable/optim.html

But they don’t seem to have the two phases as described in the passage above, or start/stop at certain learning rates.

Is there another way to reproduce the roberta learningrate schedule?

I’m looking at the fairseq repo, but I’m having a tricky time following the code, and I’m not sure how to copy specific parts into the trainer code I am working on

https://pytorch.org/docs/stable/optim.html

Santosh-Gupta · November 18, 2020, 2:05pm

I have looked into how Fairseq does the warmup https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

I am having a tricky time following the code, but it looks like the warmup is handled with PolynomialDecaySchedule (since there is a ‘@register_lr_scheduler(“polynomial_decay”)’, and ‘polynomial_decay’ is an arg for the pretraining command) , by passing an optimizer object, and specific args

I can deduce most of the args from the ‘Train Roberta Base’ command here
https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md#2-train-roberta-base

Except for power, which I am guessing will just be left to the default of 1.

I am also wondering the decay is ‘and then linearly decayed’ as mentioned in the Roberta paper. I see that there is already a decay happening with AdamW, since its decay rate is set at 0.01, but I don’t think that’s a linear decrease, and the paper does mention that there is a linear decrease.

Also, the PolynomialDecaySchedule class mentions that its supposed to be for decaying the learning rate (“Decay the LR on a fixed schedule.”). So it sounds like the class is also handling this linear decay, based on the end_learning_rate arg.

It seems to be that no arg is passed in for that, so it uses the default of zero. I’m not so sure of this though, I feel like it may have been mentioned in the paper, and its a little tricky to see what exactly is being passed into PolynomialDecaySchedule , because it doesn’t seem like that class is being instantiated in the code. It seems to be instantiated a different way which I am not able to figure out.

Maybe the end learning rate is zero after all, since that’s how it’s done in Bert, which Roberta is based on.

github.com

google-research/bert/blob/master/optimization.py#L36


"""Creates an optimizer training op."""
global_step = tf.train.get_or_create_global_step()
learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
# Implements linear decay of the learning rate.
learning_rate = tf.train.polynomial_decay(
    learning_rate,
    global_step,
    num_train_steps,
    end_learning_rate=0.0,
    power=1.0,
    cycle=False)
# Implements linear warmup. I.e., if global_step < num_warmup_steps, the
# learning rate will be `global_step/num_warmup_steps * init_lr`.
if num_warmup_steps:
  global_steps_int = tf.cast(global_step, tf.int32)
  warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
  global_steps_float = tf.cast(global_steps_int, tf.float32)

Which also shows power =1.

Santosh-Gupta · November 18, 2020, 2:50pm

I am attempting to make a custom scheduler which replicates the Roberta warmup, so far I came up with this, based on Huggingface’s linear warmup scheduler

github.com

huggingface/transformers/blob/master/src/transformers/optimization.py#L71


    """
    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1.0, num_warmup_steps))
        return 1.0
    return LambdaLR(optimizer, lr_lambda, last_epoch=last_epoch)
def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):
    """
    Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after
    a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.
    Args:
        optimizer (:class:`~torch.optim.Optimizer`):
            The optimizer for which to schedule the learning rate.
        num_warmup_steps (:obj:`int`):
            The number of steps for the warmup phase.
        num_training_steps (:obj:`int`):

def get_linear_schedule_with_warmup_with_peak(optimizer, num_warmup_steps, num_training_steps, init_lr, peak_lr, last_epoch=-1):

    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return (float(current_step) / float(max(1, num_warmup_steps)))*(peak_lr/init_lr)
        return max(
            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))
        )

    return LambdaLR(optimizer, lr_lambda, last_epoch)

Which is not quite exact, since the optimizer is AdamW, which has its own weight decay. For it to be exact, I need init_lr to be replaced by the current optimizer learning rate, but from the documentation on LambdaLR, lr_lambda seems to only take in an int.

So I need a way to adjust the optimizer’s learning rate based on its current learning rate.