Importance of Optimizers when continuing training

Epoching · December 23, 2019, 2:36am

How important is it to use the same optimizer when continuing training?

If I train a model with an Adam Optimizer wrapped in a OneCycle LR Schedule, for let’s say ~10 epochs.
Then decide to fine-tune the model and I make a new optimizer with different settings (e.g. learning rate), and train for let’s say another ~10 epochs.

Example:

# initial model, opt, and scheduler.
my_model = torchvision.models.resnet18()
my_optim = optim.Adam(lr=1e-3)
my_lr_scheduler = optim.lr_scheduler.OneCycleLR(optimizer=my_optim, max_lr=3e-2)

# Fit/train for 10 epochs
fit()

# Fine-tune model with a smaller lr.
my_optim = optim.Adam(lr=1e-5)
my_lr_scheduler = optim.lr_scheduler.OneCycleLR(optimizer=my_optim, max_lr=3e-4)

# Fit/train for 10 more epochs
fit()

Will this perform worse compared to changing the original optimizer’s hyperparameters manually via my_optim.param_groups[0]['lr'].

I noticed that an optimizer’s state dict, my_optim.state_dict(), holds two keys: state and param_groups. Making a new optimizer in the example above will make the param_groups hold the right values, but the state is lost. The below example maintains state

Example:

# initial model, opt, and scheduler.
my_model = torchvision.models.resnet18()
my_optim = optim.Adam(lr=1e-3)
my_lr_scheduler = optim.lr_scheduler.OneCycleLR(optimizer=my_optim, max_lr=3e-2)

# Fit/train for 10 epochs
fit()

# Fine-tune model with a smaller lr.
my_optim.param_groups[0]['lr'] = 1e-5
my_lr_scheduler = optim.lr_scheduler.OneCycleLR(optimizer=my_optim, max_lr=3e-4)

# Fit/train for 10 more epochs
fit()

Is there any resources out there showing the importance of whether or not to reuse Optimizers and their states?

Epoching · January 3, 2020, 7:49pm

Fun little experiment to follow up with my original question: Optimizer Benchmarks

Main conclusions from my project page/first blog:

OneCycle LR > Constant LR
Making a new optimizer vs. Preserving state and re-using the same optimizer both achieve very similar performance. i.e. Discarding an optimizer’s state didn’t really hurt the model’s performance, with or without an LR Scheduler. Maybe the state is learned quickly.

Note: Conclusions are based on the Adam optimizer and OneCycle LR Scheduler. I haven’t experimented with other optimizers to see if dropping their state is more impactful

Edit Note: I’m not proposing to always throw optimizers away, I still believe the general guideline is to use the same optimizer and keep the history, as explained below. Kindly share resources if anyone found results showing the importance of using the same optimizer

morourke · January 3, 2020, 10:03pm

The internal workings of some optimization algorithms (some examples are Adam, its variants, BFGS, and its variants) have “memory-based” components that generally improve their performance over very naive direct gradient descent procedures. Adam uses running averages over quantities computed in previous iterations, and L-BFGS stores previous gradients in order to compute an approximate Hessian (which improves the ability of the algorithms to “step” in the correct direction).

When using these types of algorithms, it is detrimental to stop the optimization and then initialize a completely new optimizer, (even if it is of the same algorithm) because the “memory” effects are thrown away.

I’m not an expert on saving the state of an optimizer, but with my moderate experience using L-BFGS I do believe that initializing a new optimizer using the state of the original optimizer will allow you to retain the “memory” from the old one. This should hopefully lead to very similar performance to the case in which you keep the old optimizer running.

These are just general guidelines for how the algorithms themselves work. I’m not sure exactly how they apply to the specific problem instance that you posted above.

Epoching · January 3, 2020, 10:11pm

I completely agree with how it’s detrimental to create new optimizers, throwing away history. I’m a bit surprised to not see much slow down/performance hinderance when doing this. It’s definitely important to continue using the same optimizer, but I’m curious to get an idea for how important it is

ptrblck · January 4, 2020, 4:34am

Often you would see a spike in your loss, if you don’t restore an optimizer with some internal states (and we have some posts about this behavior here in the discussion board).

Even with this spike your final loss might end up in the same value range as a complete training, but you might increase the training time.

Epoching · January 4, 2020, 4:35am

I see, makes sense. I’ll look around the forums further! Thanks for the tips as usual

Wei_Wong · May 10, 2020, 2:01pm

Hi, optimizer of previous saved model cannot change learning rate for next training phrase, for example the loss is decreasing, I want to change a smaller learning rate. Is there a way of saving model that can accomplish this? How can I save a model like pytorch pretrained model for example mobilenetv2, that I can train my model on this base, save model and use new optimizer or new learning rate like we do on pretrained model?

ptrblck · May 11, 2020, 12:33am

The work flow would be the same. Once you’ve stored the state_dict of the pretrained model, you could create a new model instance and load the state_dict to it.