Save state_dict of model and Adam, and pause training. Load state_dict of model and change optimizer to SGD.
SGD does not keep track of extra variables relating to weights (unless you’re using momentum). This means you can simply create a new SGD optimizer.
torch.save({'model': model.state_dict(), 'optim': optim.state_dict()}, '...')
To switch to SGD, use:
state_dict = torch.load('...')
model.load_state_dict(state_dict['model'])
optim = torch.optim.SGD(model.paramters(), new_learning_rate)
2 Likes
Thank you!!
1 Like
Your idea is using Adam to fast init training and turn to SGD at the end. I admit that is good idea and there is a paper called “Adaptive Gradient Methods with Dynamic Bound of Learning Rate. In Proc. of ICLR 2019.” has same idea.
As described in the paper, AdaBound is an optimizer that behaves like Adam at the beginning of training, and gradually transforms to SGD at the end
Please refer to this code: GitHub - Luolc/AdaBound: An optimizer that trains as fast as Adam and as good as SGD.