Loading checkpoint for resume training without loading optimizer.state_dict?

Does loading the model_state_dict and then pass model.parameters() to the optimizer is the same as loading optimzer state_dict?

Below is the example code

    if opt.epoch != 0:
        # Load pretrained models
        generator.load_state_dict(torch.load("saved_models/%s/generator_%d.pth" % (opt.dataset_name, opt.epoch)))
        discriminator.load_state_dict(torch.load("saved_models/%s/discriminator_%d.pth" % (opt.dataset_name, opt.epoch)))
    else:
        # Initialize weights
        generator.apply(weights_init_normal)
        discriminator.apply(weights_init_normal)

    # Optimizers
    optimizer_G = torch.optim.Adam(generator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))
    optimizer_D = torch.optim.Adam(discriminator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))

I think it’s up to your optimizer type. For a naive SGD, there would be no difference. But for optimizers with tracking status like Adam, you should load the optimizer from checkpoint as well.

1 Like

Thanks for the reply, may i know why Adam need to load the optimizer state while SGD not? So, in my case, I used Adam, then the example code i give will not resume training correctly?

may i know why Adam need to load the optimizer state while SGD not?

This is the result of optimizer internal design. Adam has two tracking status m_t and v_t to store historical information, while SGD doesn’t. And that’s why Adam optimizer takes more memory than SGD.

So, in my case, I used Adam, then the example code i give will not resume training correctly?

A correct way is saving the optimizer states with optimizer.state_dict() when saving the checkpoint. Then resume by optimizer.load_state_dict(...).

1 Like

Thanks I somehow understand it now.

Well, it seems that when I do not load optimizer.state_dict() the training loss still continue from the last checkpoint.

And I am using Adam, how can this happened?

This is the log without loading optimizer state:

[Epoch 0/200] [Batch 156/6000] [D loss: 0.766469] [G loss: 1.088398, pixel: 0.005941, adv: 0.494318] ETA: 17:56:19.482757

Here is with loading optimizer state:

[Epoch 0/200] [Batch 187/6000] [D loss: 0.767510] [G loss: 0.981094, pixel: 0.005185, adv: 0.462558] ETA: 18:16:18.970563

How did you know the loss is “continue” ?
Even if you don’t load the states of optimizer, note that model parameters are loaded, the loss won’t seem like training from scratch.
Back to Adam, assume you stop training at step=t-1, at resuming, if you load the optimizer states, m_t and v_t will be where you stop training (m_t-1 → m_t).
However, if not, they will be m_0 and v_0 (m_t-1 → m_0). The states just differ.

Thanks for your fast reply,

I know the loss is ‘continue’ because the value is smaller than training without loading model param, which will be something like this, the G_loss and adv loss bigger

[Epoch 0/200] [Batch 156/6000] [D loss: 0.766469] [G loss: 15.088398, pixel: 0.005941, adv: 5.494318] ETA: 17:56:19.482757

So, May I conclude it will be better to load optimizer.state_dict, but there will be significant difference ? Maybe it makes a difference for longer training, but not huge difference?

I’d say resuming the optimizer states is always recommended on a personal view.
I can’t tell whether loading the states of optimizer would improve the performance or not.
But it definitely helps on better reproducibility and more rigorous experiments.

1 Like