Using both learning rate warm up and a learning rate scheduler

I’m trying to implement both learning rate warmup and a learning rate schedule within my training loop.

I’m currently using this for learning rate warmup, specifically the LinearWarmup(). So this simply ramps up from 0 to max_lr over a given number of steps.

I’m also wanting to use CosineAnnealingWarmRestarts(optimizer, T_0, T_mult) as my lr scheduler.

The challenge is that I’m wanting to use a rather long warm up period, without using an initially high value of T_0. Is there a way I can the LR scheduler to become active over after X number of steps have passed? A simplified version of my code is below. It may also be beneficial to have the LR constant for a number of steps before the scheduled decay begins.

If my initial value of T_0 is not high enough, the LR value during the warm up period follows the pattern of the LR schedule. I’m not sure if this is something that is desirable?

import torch
import pytorch_warmup as warmup

optimizer = torch.optim.AdamW(params, lr=lr)
num_steps = len(dataloader) * num_epochs
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)
warmup_scheduler = warmup.LinearWarmup(optimizer, warmup_period=1000)
iters = len(dataloader)
for epoch in range(1,num_epochs+1):
    for idx,  batch in enumerate(dataloader):
        loss = ...
        with warmup_scheduler.dampening():
            lr_scheduler.step(epoch + idx / iters)

Yes, you could use the SequentialLR which accepts multiple schedulers and uses milestones to activate them sequentially.

I’m the maintainer for pytorch_warmup.

Please delay the LR schedule for your need:

    with warmup_scheduler.dampening():
      if warmup_scheduler.last_step + 1 >= warmup_period:

I put Example Code for CosineAnnealingWarmRestarts on Github Gist:

Linear Warmup + Cosine Annealing with Warm Restarts

Example Code with T_mult=2:

Linear Warmup + Cosine Annealing with Warm Restarts 2