Hi, I’m using checkpointing to resume training of a model. However, I am also using
torch.optim.lr_scheduler.OneCycleLR alongside which is giving me some problems. I initially train a model for 10 epochs and saved the best checkpoint and then set
max_epochs to 20 to continue training for 10 more epochs but I get the following error from lr_scheduler. Seems like when I load the checkpoint the lr_scheduler is still stuck on initial max limit on number of steps set by steps_per_epoch argument in OneCycleLR.
ValueError: Tried to step 101 times. The specified number of total steps is 100
I have tried setting the
steps_per_epoch arg of OneCycleLR or the
total_steps=self.trainer.estimated_stepping_batches but nothing helps.
Can you show how you configure the optimiser ? Did you try with verbose=True to get mode information ?
Sure, here goes:
optimizer = torch.optim.AdamW(
scheduler = torch.optim.lr_scheduler.OneCycleLR(
verbose=True in scheduler gives me log of what the learning rate is at each step and a stack trace on error but I do not know what to do with it.
I was hoping to find some standard practice to update the steps property of scheduler after loading the checkpoint but couldn’t find it. Another hack I was thinking was to not use the scheduler if I’m loading from a checkpoint, which does work but it all seems too much. I hope there’s a better way.