Why doesn't resuming work properly in pytorch?

Hi, I cant resume properly from a checkpoint. each time I try to resume, it seems the training statistics are either invalid or missing since the accuracy gets very bad!
For instance, I save a checkpoint at epoch 80, and I get 62.5% accuracy, When I resume from this very checkpoint, the accuracy now becomes 34!!
What am I doing wrong here? here is the snippet for saving and resuming :

    # optionally resume from a checkpoint
    if args.resume:
        if os.path.isfile(args.resume):
            print_log("=> loading checkpoint '{}'".format(args.resume), log)
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            best_prec1 = checkpoint['best_prec1']
            print_log("=> loaded checkpoint '{}' (epoch {})".format(args.resume, checkpoint['epoch']), log)
            print_log("=> no checkpoint found at '{}'".format(args.resume), log)
            'epoch': epoch + 1,
            'arch': args.arch,
            'state_dict': model.state_dict(),
            'best_prec1': best_prec1,
            'optimizer' : optimizer.state_dict(),
        }, is_best, filename, bestname)
        # measure elapsed time
        epoch_time.update(time.time() - start_time)
        start_time = time.time()
def save_checkpoint(state, is_best, filename, bestname):
    torch.save(state, filename)
    if is_best:
        shutil.copyfile(filename, bestname)

Any help is greatly appreciated

The code snippet looks fine to me.
Could you provide some information regarding the training? I assume you are observing the training loss and accuracy, save the model, and after resuming you see a higher price training loss.

I created a small example some time ago and could not reproduce the issue (from another thread).

Probably we would have to inspect other parts of your code to see if something goes wrong.

Hi, Thank you very much, here is the link to the full source code : main.py
in the mean time, after posting the question, I had an idea, it occurred to me that it might have something to do with the BatchNormalization statistics not being properly loaded, I’m not sure, if this is the case. so how can I check for that?

I couldn’t find any obvious mistakes, but could it be related to the AverageMeter, which is restarted?
Could you train for a few epochs, reset the AverageMeter and have a look at your loss?

Its not the AverageMeter, since the accuracy drops hugely as well.
I found the issue . the scheduler needed to be saved and restored upon resuming as well.
It is explained here