When i resume the training accuracy has big difference. what should i check?

seungtaek94 · March 3, 2020, 3:38am

Hi.

I save the checkpoints using state_dict like below;

def save_checkpoint(states, is_best, output_dir,
                    filename='checkpoint.pth.tar'):
    torch.save(states, os.path.join(output_dir, filename))
    if is_best and 'state_dict' in states:
        torch.save(states['state_dict'],
                   os.path.join(output_dir, 'model_best.pth.tar'))

save_checkpoint({
            'epoch': epoch + 1,
            'model': get_model_name(config),
            'state_dict': model.state_dict(),
            'perf': perf_indicator,
            'optimizer': optimizer.state_dict(),
        }, best_model, final_output_dir)

This is an example of my situation.
e.g.)
If i stop training at 150 epoch and resume from 151 epoch;

150 epochs accuracy is 90%(final accuracy before resume training)
but, 151 epochs accuracy is 80%…

There is a big difference in the accuracy before and after resuming training.

Why is this happening and what should I check?

alex.veuthey · March 3, 2020, 7:28am

Do you have a learning rate scheduler? If so, you will also need to save its state and restart it properly. Of course, the other elements (model, optimizer) need to be reloaded properly…