Why save optimizer state dict?

Hi everyone :slight_smile:

Probably a simple question but I can’t seem to figure out why exactly we should save the optimizer.state_dict() to resume training? Why isn’t it sufficient to save the model.state_dict()? The optimizer.state_dict() does not have any learnable parameters if I understand it correctly…

Thank you for the clarification!

All the best,
snowe

most optimizers collect per-parameter gradient statistics, so you’d save things like gradient’s running mean & variance, and resumed training would make better initial steps.

Hi @googlebot, thank you for your response! :slight_smile:

But theoretically it would also be possible to resume training without the optimizer for a loss in initial performance?

If for example I want to take the pretrained parameters and use them in another network with a different structure. I can do that for the model.state_dict() with the strict=False flag, but there is no such thing for the optimizer.state_dict()

There is a difference between using pretrained parameters and resuming training, if you “transfer” parameters, gradient history no longer applies and so optimizer should start from a clean state.

I don’t exactly understand what you mean…

From what I understand, if you want to use a pretrained model to test it (no more learning), you can use the model.state_dict() to load the ‘best’ weights. If you want to continue training you need both, the model.state_dict() and the optimizer.state_dict().

My problem applies to the second use case, where I want to continue training but my model is slightly different. So I am not sure whether I still need the optimizer.state_dict() and if so, how I can apply it to a new model.

No, you’d reload optimizer’s state_dict if you want to pause/resume training at epoch N>0 for whatever reason. If model or dataset changes, that should be considered a new run from epoch 0; you’re free to reload parameters from model.load_state_dict(strict=False) for it, there is no need for old optimizer’s state (it only contains stale auxiliary buffers).

PS even if you’re training reloaded params further, new run will have differ gradient statistics, so optimizer’s reload is not good

Ah I see, that makes sense. Thank you!

But loading the pretrained parameters will still speed up the training right, even if the network and the data are not exactly the same? They should be considerably better than random weights?

Sure, you’re just resetting learning rate “adaptations” from old training.

1 Like