Loaded model has higher loss in comparison with saved one

I’ve developed a basic transformer model for translation task. Model itself works well, train and validation loss decreases and BLEU score increases as epochs go on. My problem is related to loading the model itself, when I load the model after one epoch, loaded model’s loss is higher than version of model before saving. For instance, when epoch 20 has ended I got loss = 2.03, if I save the model and load it to start from epoch 21, loaded model has loss = 2.5.

The interesting thing is that if I set a seed when I load my dataset, it works well when I load the model, but model performance during the validation is way lower when I do not set seed.

here are my functions to save and load the model:

def save_checkpoint(self, checkpoint_path, loss, epoch, type='regular'):
    torch.save({
        'model': self.model.state_dict(),
        'optimizer': self.optimizer.state_dict(),
        'scheduler': self.scheduler.state_dict(),
        'loss': loss,
        'epoch': epoch
    }, checkpoint_path)

 def load_checkpoint(self, checkpoint_path):
        if os.path.exists(checkpoint_path):
            checkpoint = torch.load(checkpoint_path)
            self.model.load_state_dict(checkpoint['model'])
            self.optimizer.load_state_dict(checkpoint['optimizer'])
            self.scheduler.load_state_dict(checkpoint['scheduler'])
            return checkpoint['loss'], checkpoint['epoch'] + 1

Many thanks in advance for your time and efforts.

Compare outputs of your model in eval mode using a static input tensor before and after reloading the state_dict. If these outputs are equal the state_dict loading works fine and the issue comes from another part of your training script.

I’ve tested your idea but the inconsistency is still there. Do you any suggestion regarding data loader?

If the outputs in eval mode for the same static input are different, the data loading pipeline is not related and you would need to look into the model.
You could print the intermediates directly in the forward of the model or you could use forward hooks to check them. Computing tensor.abs().sum() might be a good starter as it should indicate which layer causes the different results.