Hi, I’m trying to implement training with check points using the above ideas, so that I could resume training from say, Epoch k and re-train the model from Epoch k to N. Suppose I’ve saved the following into the model file and reloaded in resume training: epoch, model’s state_dict(), optimizer, but I’m not seen similar training results between the two ways:
- train the model from Epoch 1 to N.
- train the model from Epoch1 to k, save the model, and resume training starting from Epoch k to N.
I checked the learning rates to be consistent between 1) and 2), using SGD with the same momentum and weight decaying rates.
Any ideas where I should be looking into?
Thanks!