Resuming from model checkpoints

cswangjiawei · November 28, 2019, 9:03am

I want to resume from model checkpoints, but the result of the model trained from the checkpoint on the development set is different from that of the model trained from scratch at the same epoch. Such as, if the training starts from scratch, at the third epoch, the accuracy of the model on the development set is 91.41666666666667, however, if the model trained from the checkpoint(checkpoint is second epoch), the accuracy of the model on the development set is 90.675 at the third epoch. I save the model using the following code：

checkpoint = {
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'best_acc': best_acc
        }
        torch.save(checkpoint, checkpoint_dir + os.sep + 'last_epoch.pt')

and I restore the model using the following code:

checkpoint = torch.load(checkpoint_dir+os.sep+'last_epoch.pt')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
initepoch = checkpoint['epoch'] + 1
best_acc = checkpoint['best_acc']

I think the reason of the difference is the DataLoader RNG, How I solve it?

ptrblck · November 30, 2019, 7:02am

Did you seed the code properly and is this drop in accuracy reproducible?
I.e. do you always get these numbers if you rerun the entire code and try to just retrain the last epoch?

If you didn’t use any seeds (and are using e.g. torch.backends.cudnn.benchmark = True), I would assume your final accuracy might get some variance.

cswangjiawei · November 30, 2019, 11:24am

I used seed, the code is as follows:

seed_num = 123
random.seed(seed_num)
torch.manual_seed(seed_num)
np.random.seed(seed_num)

If I run the code from scratch twice, the accuracy of the same epoch on the development set is the same.
Training from the breakpoint caused a change in the data shuffle order (the first epoch followed by training from the checkpoint at the end of the second epoch training is different from the data shuffle order of the third epoch training from scratch), I guess this It is the reason that the accuracy of the same epoch model on the development set is different. How to make the data shuffle when training from the checkpoint the same as training from the scratch?