Loading checkpoint for resume training without loading optimizer.state_dict?

nicozhou · September 8, 2021, 3:40pm

Does loading the model_state_dict and then pass model.parameters() to the optimizer is the same as loading optimzer state_dict?

Below is the example code

    if opt.epoch != 0:
        # Load pretrained models
        generator.load_state_dict(torch.load("saved_models/%s/generator_%d.pth" % (opt.dataset_name, opt.epoch)))
        discriminator.load_state_dict(torch.load("saved_models/%s/discriminator_%d.pth" % (opt.dataset_name, opt.epoch)))
    else:
        # Initialize weights
        generator.apply(weights_init_normal)
        discriminator.apply(weights_init_normal)

    # Optimizers
    optimizer_G = torch.optim.Adam(generator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))
    optimizer_D = torch.optim.Adam(discriminator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))

huahuanZ · September 8, 2021, 4:24pm

I think it’s up to your optimizer type. For a naive SGD, there would be no difference. But for optimizers with tracking status like Adam, you should load the optimizer from checkpoint as well.

nicozhou · September 8, 2021, 4:31pm

Thanks for the reply, may i know why Adam need to load the optimizer state while SGD not? So, in my case, I used Adam, then the example code i give will not resume training correctly?

huahuanZ · September 8, 2021, 4:52pm

may i know why Adam need to load the optimizer state while SGD not?

This is the result of optimizer internal design. Adam has two tracking status m_t and v_t to store historical information, while SGD doesn’t. And that’s why Adam optimizer takes more memory than SGD.

So, in my case, I used Adam, then the example code i give will not resume training correctly?

A correct way is saving the optimizer states with optimizer.state_dict() when saving the checkpoint. Then resume by optimizer.load_state_dict(...).

nicozhou · September 9, 2021, 4:06am

Thanks I somehow understand it now.

nicozhou · September 9, 2021, 5:39am

Well, it seems that when I do not load optimizer.state_dict() the training loss still continue from the last checkpoint.

And I am using Adam, how can this happened?

This is the log without loading optimizer state:

[Epoch 0/200] [Batch 156/6000] [D loss: 0.766469] [G loss: 1.088398, pixel: 0.005941, adv: 0.494318] ETA: 17:56:19.482757

Here is with loading optimizer state:

[Epoch 0/200] [Batch 187/6000] [D loss: 0.767510] [G loss: 0.981094, pixel: 0.005185, adv: 0.462558] ETA: 18:16:18.970563

huahuanZ · September 9, 2021, 9:32am

How did you know the loss is “continue” ?
Even if you don’t load the states of optimizer, note that model parameters are loaded, the loss won’t seem like training from scratch.
Back to Adam, assume you stop training at step=t-1, at resuming, if you load the optimizer states, m_t and v_t will be where you stop training (m_t-1 → m_t).
However, if not, they will be m_0 and v_0 (m_t-1 → m_0). The states just differ.

nicozhou · September 9, 2021, 9:45am

Thanks for your fast reply,

I know the loss is ‘continue’ because the value is smaller than training without loading model param, which will be something like this, the G_loss and adv loss bigger

[Epoch 0/200] [Batch 156/6000] [D loss: 0.766469] [G loss: 15.088398, pixel: 0.005941, adv: 5.494318] ETA: 17:56:19.482757

So, May I conclude it will be better to load optimizer.state_dict, but there will be significant difference ? Maybe it makes a difference for longer training, but not huge difference?

huahuanZ · September 9, 2021, 9:53am

I’d say resuming the optimizer states is always recommended on a personal view.
I can’t tell whether loading the states of optimizer would improve the performance or not.
But it definitely helps on better reproducibility and more rigorous experiments.