Save checkpoint every step instead of epoch

ngoquanghuy · May 28, 2021, 4:02am

My training set is truly massive, a single sentence is absolutely long. An epoch takes so much time training so I don’t want to save checkpoint after each epoch. Instead i want to save checkpoint after certain steps.
Can I just do that in normal way?

ptrblck · May 28, 2021, 8:25am

Yes, you can store the state_dicts whenever wanted.

ngoquanghuy · May 28, 2021, 6:16pm

Thanks sir!
But my goal is to resume training from the last checkpoint (checkpoint after curtain steps).
With epoch, it’s so easy to continue training with several more epochs. But with step, it is a bit complex.
Could you please give any snippet?

ptrblck · May 28, 2021, 7:32pm

In case you want to continue from the same iteration, you would need to store the model, optimizer, and learning rate scheduler state_dicts as well as the current epoch and iteration. Assuming you want to get the same training batch, you could iterate the DataLoader in an “empty” loop until the appropriate iteration is reached (you could also seed the code properly so that the same “random” transformations are used, if needed).

ngoquanghuy · May 28, 2021, 7:45pm

Thanks sir!
I got it.