Can we continue training a model after we stop training it as if we did not stopped it?

jmlipman · October 28, 2022, 4:25pm

I originally assumed that, in order to continue the training where it stopped, we just needed to store the weights, the optimizer state, current epochs/iterations, and re-adjust the learning rate if there was any scheduling. However, the order in which the samples are drawn from the training set are going to be drawn in a different order, regardless of the seed you use.

This happens because when “iter(tr_loader)” is called, a new random seed is used, which I guess is an efficient way to shuffle the data after every epoch. Is there any good way to continue sampling the data in the way it was supposed to, besides running Tensor.random_ n times where n is the number of times you previously called “iter(tr_loader)” and “iter(val_loader)”?

You can read more about this is-not-a-bug-it’s-a-feature issue here: Order of the elements in the queue changes unexpectedly · Issue #991 · fepegar/torchio · GitHub

Long story short, this:

_ = enumerate(val_loader)
_ = enumerate(val_loader)
for i, patch in enumerate(tr_loader):                                           
    print(patch["info"])

will draw the samples in a different order than if you do this

for i, patch in enumerate(tr_loader):                                           
    print(patch["info"])

because the Tensor.random_() used in the for loop is not the same anymore, as it was called before.