Suggestions requested for saving and loading model snapshots/checkpoints

prabu-github · October 11, 2017, 4:02am

I have access to a GPU cluster that provides 20 minute slots.
I would like to run the training procedure for as many ‘mini-batches’ (not epochs) as possible in those 20 minutes and save the state. In the next 20 minute slot I would like to start training from the previous state. Can you please provide me with some pointers/example of how to do this? I am a beginner in PyTorch and apologies if this has been discussed elsewhere.

smth · October 11, 2017, 4:32am

you can look at our examples repository on how to save / load models, maybe the imagenet example will serve the purpose: https://github.com/pytorch/examples/

prabu-github · October 11, 2017, 3:28pm

Thanks for the pointer @smth and the prompt reply. I will check it out.