~/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/optim/lr_scheduler.py in init(self, optimizer, last_epoch)
18 if āinitial_lrā not in group:
19 raise KeyError("param āinitial_lrā is not specified "
ā> 20 āin param_groups[{}] when resuming an optimizerā.format(i))
21 self.base_lrs = list(map(lambda group: group[āinitial_lrā], optimizer.param_groups))
22 self.step(last_epoch + 1)
KeyError: āparam āinitial_lrā is not specified in param_groups[0] when resuming an optimizerā
You are trying to initialize a new optimizer and initialize the scheduler to another last_epoch.
As the optimizer wasnāt used in the scheduler from the beginning, the param_groupinitial_lr is missing.
What is your exact use case?
Would you like to use the scheduler as if it was already used for 100 epochs?
If so you could set last_epoch=-1 in the instantiation and call the scheduler 100 times in a dummy for loop.
āLast_epochā is an argument for users which means we can specify it as any number instead of -1.
If we canāt even assign it to other numbers when initialize, isnāt this arg redundant?
I prefer a design that can automatically specify epoch state with ālast_epochā arg.
Getting this error my self in https://github.com/ultralytics/yolov3. The band-aid āsolutionā was to define the attribute after the scheduler is already defined. Iām not sure if the scheduler is actually properly initialized to the correct LR, but the code runs without errors in the second example below:
Traceback (most recent call last):
File "train.py", line 423, in <module>
train() # train normally
File "train.py", line 152, in train
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf, last_epoch=start_epoch - 1)
File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 189, in __init__
super(LambdaLR, self).__init__(optimizer, last_epoch)
File "/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py", line 41, in __init__
"in param_groups[{}] when resuming an optimizer".format(i))
KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer"
class LambdaLR(_LRScheduler):
"""Sets the learning rate of each parameter group to the initial lr
times a given function. When last_epoch=-1, sets initial lr as lr.
......
Why initial_lr is missing in last.pt is because the optimizer will not be saved in the last epoch. I think that the mechanism of resume training is designed for the scheme that the training is interrupted, not for continuing after the last epoch.
If we want to continue training after the last epoch, we may modify the line 'optimizer': None if final_epoch else optimizer.state_dict()
to 'optimizer': optimizer.state_dict()
in train.py.
Iām not sure but it seems to work for me.
# Save training results
save = (not opt.nosave) or (final_epoch and not opt.evolve)
if save:
with open(results_file, 'r') as f:
# Create checkpoint
chkpt = {'epoch': epoch,
'best_fitness': best_fitness,
'training_results': f.read(),
'model': model.module.state_dict() if hasattr(model, 'module') else model.state_dict(),
'optimizer': None if final_epoch else optimizer.state_dict()}
Ah, yes, thanks for the feedback! Yes you are correct, --resume is really only intended for accidentally stopped training. i.e. you train to 300, but your computer shuts down at 100. You can use the same exact training command you originally used plus --resume to finish the training to 300.
If you train to 300/300, and then decide you want to train to 400, you are out of luck, because the LR scheduler has already reduced to near zero, and defining a new number of --epochs will create a nonlinearity in the LR scheduler. In this case you should restart your training from the beginning with --epochs 400.
And to answer your last point, we actually remove the optimizer on purpose after complaints about file sizes, as the optimizer will double the size of the weightfile, since the file is now carrying gradients for each parameter in addition to the parameters themselves.