Checkpointing with lr_scheduler

Hi, I’m using checkpointing to resume training of a model. However, I am also using torch.optim.lr_scheduler.OneCycleLR alongside which is giving me some problems. I initially train a model for 10 epochs and saved the best checkpoint and then set max_epochs to 20 to continue training for 10 more epochs but I get the following error from lr_scheduler. Seems like when I load the checkpoint the lr_scheduler is still stuck on initial max limit on number of steps set by steps_per_epoch argument in OneCycleLR.

ValueError: Tried to step 101 times. The specified number of total steps is 100

I have tried setting the steps_per_epoch arg of OneCycleLR or the total_steps=self.trainer.estimated_stepping_batches but nothing helps.

Can you show how you configure the optimiser ? Did you try with verbose=True to get mode information ?

1 Like

Sure, here goes:

optimizer = torch.optim.AdamW(
    self.parameters(), weight_decay=1e-2)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=3e-2,
    epochs=10, steps_per_epoch=len(train_dataloader)

Using verbose=True in scheduler gives me log of what the learning rate is at each step and a stack trace on error but I do not know what to do with it.

I was hoping to find some standard practice to update the steps property of scheduler after loading the checkpoint but couldn’t find it. Another hack I was thinking was to not use the scheduler if I’m loading from a checkpoint, which does work but it all seems too much. I hope there’s a better way.