Does loading the model_state_dict and then pass model.parameters() to the optimizer is the same as loading optimzer state_dict?
Below is the example code
if opt.epoch != 0:
# Load pretrained models
generator.load_state_dict(torch.load("saved_models/%s/generator_%d.pth" % (opt.dataset_name, opt.epoch)))
discriminator.load_state_dict(torch.load("saved_models/%s/discriminator_%d.pth" % (opt.dataset_name, opt.epoch)))
# Initialize weights
optimizer_G = torch.optim.Adam(generator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))
optimizer_D = torch.optim.Adam(discriminator.parameters(), lr=opt.lr, betas=(opt.b1, opt.b2))
I think it’s up to your optimizer type. For a naive SGD, there would be no difference. But for optimizers with tracking status like Adam, you should load the optimizer from checkpoint as well.
Thanks for the reply, may i know why Adam need to load the optimizer state while SGD not? So, in my case, I used Adam, then the example code i give will not resume training correctly?
may i know why Adam need to load the optimizer state while SGD not?
This is the result of optimizer internal design. Adam has two tracking status m_t and v_t to store historical information, while SGD doesn’t. And that’s why Adam optimizer takes more memory than SGD.
So, in my case, I used Adam, then the example code i give will not resume training correctly?
A correct way is saving the optimizer states with
optimizer.state_dict() when saving the checkpoint. Then resume by
Thanks I somehow understand it now.
Well, it seems that when I do not load optimizer.state_dict() the training loss still continue from the last checkpoint.
And I am using Adam, how can this happened?
This is the log without loading optimizer state:
[Epoch 0/200] [Batch 156/6000] [D loss: 0.766469] [G loss: 1.088398, pixel: 0.005941, adv: 0.494318] ETA: 17:56:19.482757
Here is with loading optimizer state:
[Epoch 0/200] [Batch 187/6000] [D loss: 0.767510] [G loss: 0.981094, pixel: 0.005185, adv: 0.462558] ETA: 18:16:18.970563
How did you know the loss is “continue” ?
Even if you don’t load the states of optimizer, note that model parameters are loaded, the loss won’t seem like training from scratch.
Back to Adam, assume you stop training at
step=t-1, at resuming, if you load the optimizer states, m_t and v_t will be where you stop training (m_t-1 → m_t).
However, if not, they will be m_0 and v_0 (m_t-1 → m_0). The states just differ.
Thanks for your fast reply,
I know the loss is ‘continue’ because the value is smaller than training without loading model param, which will be something like this, the G_loss and adv loss bigger
[Epoch 0/200] [Batch 156/6000] [D loss: 0.766469] [G loss: 15.088398, pixel: 0.005941, adv: 5.494318] ETA: 17:56:19.482757
So, May I conclude it will be better to load optimizer.state_dict, but there will be significant difference ? Maybe it makes a difference for longer training, but not huge difference?
I’d say resuming the optimizer states is always recommended on a personal view.
I can’t tell whether loading the states of optimizer would improve the performance or not.
But it definitely helps on better reproducibility and more rigorous experiments.