Save checkpoint every step instead of epoch

My training set is truly massive, a single sentence is absolutely long. An epoch takes so much time training so I don’t want to save checkpoint after each epoch. Instead i want to save checkpoint after certain steps.
Can I just do that in normal way?

Yes, you can store the state_dicts whenever wanted.

1 Like

Thanks sir!
But my goal is to resume training from the last checkpoint (checkpoint after curtain steps).
With epoch, it’s so easy to continue training with several more epochs. But with step, it is a bit complex.
Could you please give any snippet?

In case you want to continue from the same iteration, you would need to store the model, optimizer, and learning rate scheduler state_dicts as well as the current epoch and iteration. Assuming you want to get the same training batch, you could iterate the DataLoader in an “empty” loop until the appropriate iteration is reached (you could also seed the code properly so that the same “random” transformations are used, if needed).

3 Likes

Thanks sir!
I got it.