A problem occured when resuming an optimizer

Ah, yes, thanks for the feedback! Yes you are correct, --resume is really only intended for accidentally stopped training. i.e. you train to 300, but your computer shuts down at 100. You can use the same exact training command you originally used plus --resume to finish the training to 300.

If you train to 300/300, and then decide you want to train to 400, you are out of luck, because the LR scheduler has already reduced to near zero, and defining a new number of --epochs will create a nonlinearity in the LR scheduler. In this case you should restart your training from the beginning with --epochs 400.

And to answer your last point, we actually remove the optimizer on purpose after complaints about file sizes, as the optimizer will double the size of the weightfile, since the file is now carrying gradients for each parameter in addition to the parameters themselves.

1 Like