My training set is truly massive, a single sentence is absolutely long. An epoch takes so much time training so I don’t want to save checkpoint after each epoch. Instead i want to save checkpoint after certain steps.
Can I just do that in normal way?
Yes, you can store the state_dict
s whenever wanted.
Thanks sir!
But my goal is to resume training from the last checkpoint (checkpoint after curtain steps).
With epoch
, it’s so easy to continue training with several more epochs. But with step
, it is a bit complex.
Could you please give any snippet?
In case you want to continue from the same iteration, you would need to store the model, optimizer, and learning rate scheduler state_dict
s as well as the current epoch and iteration. Assuming you want to get the same training batch, you could iterate the DataLoader
in an “empty” loop until the appropriate iteration is reached (you could also seed the code properly so that the same “random” transformations are used, if needed).
Thanks sir!
I got it.