When i resume the training accuracy has big difference. what should i check?


I save the checkpoints using state_dict like below;

def save_checkpoint(states, is_best, output_dir,
    torch.save(states, os.path.join(output_dir, filename))
    if is_best and 'state_dict' in states:
                   os.path.join(output_dir, 'model_best.pth.tar'))

            'epoch': epoch + 1,
            'model': get_model_name(config),
            'state_dict': model.state_dict(),
            'perf': perf_indicator,
            'optimizer': optimizer.state_dict(),
        }, best_model, final_output_dir)

This is an example of my situation.
If i stop training at 150 epoch and resume from 151 epoch;

  • 150 epochs accuracy is 90%(final accuracy before resume training)
    but, 151 epochs accuracy is 80%…

There is a big difference in the accuracy before and after resuming training.

Why is this happening and what should I check?

Do you have a learning rate scheduler? If so, you will also need to save its state and restart it properly. Of course, the other elements (model, optimizer) need to be reloaded properly…