I want to resume from model checkpoints, but the result of the model trained from the checkpoint on the development set is different from that of the model trained from scratch at the same epoch. Such as, if the training starts from scratch, at the third epoch, the accuracy of the model on the development set is 91.41666666666667, however, if the model trained from the checkpoint(checkpoint is second epoch), the accuracy of the model on the development set is 90.675 at the third epoch. I save the model using the following code:
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'best_acc': best_acc
}
torch.save(checkpoint, checkpoint_dir + os.sep + 'last_epoch.pt')
and I restore the model using the following code:
checkpoint = torch.load(checkpoint_dir+os.sep+'last_epoch.pt')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
initepoch = checkpoint['epoch'] + 1
best_acc = checkpoint['best_acc']
I think the reason of the difference is the DataLoader RNG, How I solve it?