Discontinuity in learning rate value when resuming training from checkpoint

clems · August 17, 2020, 7:13pm

hey, I’m trying to resume training from a given checkpoint using pytorch CosineAnnealingLR scheduler.
let’s say I want to train a model for 100 epochs, but, for some reason, I had to stop training after epoch 45 but saved both the optimizer state and the scheduler state.
I want to resume training from epoch 46. I’ve followed what has previously been chatted on this forum to resume training from a given epoch, but when plotting learning rates values as a function of epochs, I get a discontinuity at epoch 46 (see figure below, plot on the left).

For comparison, I run the full 100 epochs and plotted the learning rate to show what the expected plot should look like (see figure below, plot in the center).

We can see both plots do not match when displaying them on the same figure (see figure below, plot on the right ; in green: expected plot ; in blue: plot with discontinuity)

Here is a snippet of the code I’ve used to resume training:

intial_epoch = 0
nepochs_first = 45
nepochs_total = 100

base_lr = 0.0001
optimizer_first = torch.optim.Adam(model.parameters(), lr=base_lr)

scheduler_first = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer_first, T_max=nepochs_total, last_epoch=intial_epoch-1)

lr_first = []
for i in range(intial_epoch+1, nepochs_first+1):
    scheduler_first.step()
    lr_first.append(scheduler_first.get_last_lr()[-1])

optimizer_state, scheduler_state = optimizer_first.state_dict(), scheduler_first.state_dict()

optimizer = torch.optim.Adam(model.parameters(), lr=1)
# I deliberately set the initial lr to a different value than base_lr, and it should be overwritten when loading the state_dict
optimizer.load_state_dict(optimizer_state)

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=nepochs_total, last_epoch=nepochs_first-1)
scheduler.load_state_dict(scheduler_state)

prev_epoch = scheduler_state['last_epoch']
lr = []
for i in range(prev_epoch+1, nepochs_total+1):
    scheduler.step()
    lr.append(scheduler.get_last_lr()[-1])

I’ve tried a bunch of things but I couldn’t manage to get over this discontinuity.
Thanks you for your help!

clems · August 17, 2020, 7:32pm

the discontinuity impacts all LR values computed from epoch 46 to epoch 100.
indeed, see what this quick snippet gave:

for i, val in enumerate(lr_first+lr):
    if val != lr_expected[i]:
        print(f'epoch: {i+1} \t actual lr: {val} \t expected lr: {lr_expected[i]}')

clems · August 31, 2020, 2:44pm

any ideas what I did wrong?

ebelle · August 31, 2020, 3:19pm

Have you tried saving the learning rate from the previous model and putting that in when you set up the optimizer instead of relying on the state dict?

optimizer = torch.optim.Adam(model.parameters(), lr=previously_saved_lr)

clems · June 18, 2021, 12:07pm

hi, thanks for the suggestion but the learning rate scheduler returns an error if last_epoch != -1 and optimizer’s state doesn’t have a ‘initial_lr’ key

The issue I’m facing may come from pytorch _LRScheduler class __init__ method, which ends with a self.step() call that may change the learning rate value…
I’m thinking about this because as soon as I instantiate the LR scheduler, the optimizer’s learning rate is modified:

# let's say we have stopped training after epoch 45 out of 100 
# and we want to resume training
epoch_restart = 45

new_optimizer = torch.optim.Adam(model.parameters(), lr=1)
# load optimizer state saved when training stopped 
new_optimizer.load_state_dict(optimizer_state)

# retrieve optimizer's lr value before instantiating LR scheduler
lr_first = new_optimizer.state_dict()['param_groups'][0]['lr']

new_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    new_optimizer,
    T_max=100,
    eta_min=0,
    last_epoch=epoch_restart,
)

# retrieve optimizer's lr value after instantiating LR scheduler
lr_then = new_optimizer.state_dict()['param_groups'][0]['lr']

if you compare lr_first and lr_then, they’ll be different!
why?