I think itâs up to your optimizer type. For a naive SGD, there would be no difference. But for optimizers with tracking status like Adam, you should load the optimizer from checkpoint as well.
Thanks for the reply, may i know why Adam need to load the optimizer state while SGD not? So, in my case, I used Adam, then the example code i give will not resume training correctly?
may i know why Adam need to load the optimizer state while SGD not?
This is the result of optimizer internal design. Adam has two tracking status m_t and v_t to store historical information, while SGD doesnât. And thatâs why Adam optimizer takes more memory than SGD.
So, in my case, I used Adam, then the example code i give will not resume training correctly?
A correct way is saving the optimizer states with optimizer.state_dict() when saving the checkpoint. Then resume by optimizer.load_state_dict(...).
How did you know the loss is âcontinueâ ?
Even if you donât load the states of optimizer, note that model parameters are loaded, the loss wonât seem like training from scratch.
Back to Adam, assume you stop training at step=t-1, at resuming, if you load the optimizer states, m_t and v_t will be where you stop training (m_t-1 â m_t).
However, if not, they will be m_0 and v_0 (m_t-1 â m_0). The states just differ.
I know the loss is âcontinueâ because the value is smaller than training without loading model param, which will be something like this, the G_loss and adv loss bigger
So, May I conclude it will be better to load optimizer.state_dict, but there will be significant difference ? Maybe it makes a difference for longer training, but not huge difference?
Iâd say resuming the optimizer states is always recommended on a personal view.
I canât tell whether loading the states of optimizer would improve the performance or not.
But it definitely helps on better reproducibility and more rigorous experiments.