Learning rate scheduler

grid_world · April 17, 2021, 7:07am

Hey guys, I am trying to recreate results of “The Lottery Ticket Hypothesis” by Frankle et al. They are using CIFAR-10 which has 50K training images. Using a batch size = 64 gives 781 iterations/steps in one epoch.

For VGG-18 & ResNet-18, the authors propose the following learning rate schedule

Linear learning rate warmup for first k = 7813 steps from 0.0 to 0.1

After 10 epochs or 7813 training steps, the learning rate schedule is as follows-

For the next 21094 training steps (or, 27 epochs), use a learning rate of 0.1
For the next 13282 training steps (or, 17 epochs), use a learning rate of 0.01
For any remaining training steps, use a learning rate of 0.001

I have implemented this in TensorFlow2 as follows:



from typing import Callable, List, Optional, Union

class WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):
    """
    Applies a warmup schedule on a given learning rate decay schedule.

    Args:
        initial_learning_rate (:obj:`float`):
            The initial learning rate for the schedule after the warmup (so this will be the learning rate at the end
            of the warmup).
            
        decay_schedule_fn (:obj:`Callable`):
            The schedule function to apply after the warmup for the rest of training.
        
        warmup_steps (:obj:`int`):
            The number of steps for the warmup part of training.
        
        power (:obj:`float`, `optional`, defaults to 1):
            The power to use for the polynomial warmup (defaults is a linear warmup).
        
        name (:obj:`str`, `optional`):
            Optional name prefix for the returned tensors during the schedule.
    """

    def __init__(
        self,
        initial_learning_rate: float,
        decay_schedule_fn: Callable,
        warmup_steps: int,
        power: float = 1.0,
        name: str = None,
    ):
        super().__init__()
        self.initial_learning_rate = initial_learning_rate
        self.warmup_steps = warmup_steps
        self.power = power
        self.decay_schedule_fn = decay_schedule_fn
        self.name = name

    def __call__(self, step):
        with tf.name_scope(self.name or "WarmUp") as name:
            # Implements polynomial warmup. i.e., if global_step < warmup_steps, the
            # learning rate will be `global_step/num_warmup_steps * init_lr`.
            global_step_float = tf.cast(step, tf.float32)
            warmup_steps_float = tf.cast(self.warmup_steps, tf.float32)
            warmup_percent_done = global_step_float / warmup_steps_float
            warmup_learning_rate = self.initial_learning_rate * tf.math.pow(warmup_percent_done, self.power)
            return tf.cond(
                global_step_float < warmup_steps_float,
                lambda: warmup_learning_rate,
                lambda: self.decay_schedule_fn(step - self.warmup_steps),
                name=name,
            )

    def get_config(self):
        return {
            "initial_learning_rate": self.initial_learning_rate,
            "decay_schedule_fn": self.decay_schedule_fn,
            "warmup_steps": self.warmup_steps,
            "power": self.power,
            "name": self.name,
        }


boundaries = [21093, 34376]
values = [0.1, 0.01, 0.001]

learning_rate_fn = tf.keras.optimizers.schedules.PiecewiseConstantDecay(boundaries, values)

warmup_shcedule = WarmUp(initial_learning_rate = 0.1, decay_schedule_fn = learning_rate_fn, warmup_steps = 7813)

optimizer = tf.keras.optimizers.SGD(learning_rate = warmup_shcedule, momentum = 0.9, decay = 0.0, nesterov = False)

I then train model using “tf.GradientTape” and view the learning rate as follows:

optimizer._decayed_lr('float32').numpy()

The resulting learning during the 60 epochs of training can be viewed:

You can access the complete code here. Since I am new to PyTorch, can you show me how I can achieve the same learning rate scheduler in torch?

Thanks