Saving and restoring model weights at the mini-batch level?

campellcl · August 10, 2018, 11:34pm

Thank you for your quick response. So the ImageNet Example on GitHub is misusing checkpoints? See:

PyTorch Discussion on Saving and Restoring Training and the related:
- ImageNet PyTorch GitHub Example: Save Checkpoint of Best Performing Epoch
- ImageNet PyTorch GitHub Example: Restoring Training Progress from Checkpoint

What I am really trying to ask is, can I do what is shown here:

 save_checkpoint({
            'epoch': epoch + 1,
            'arch': args.arch,
            'state_dict': model.state_dict(),
            'best_prec1': best_prec1,
            'optimizer' : optimizer.state_dict(),
        }, is_best)

with:

def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'):
    torch.save(state, filename)
    if is_best:
        shutil.copyfile(filename, 'model_best.pth.tar')

and resuming from the checkpoint:

 if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}'".format(args.resume))
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            best_prec1 = checkpoint['best_prec1']
            model.load_state_dict(checkpoint['state_dict'])
            optimizer.load_state_dict(checkpoint['optimizer'])
            print("=> loaded checkpoint '{}' (epoch {})"
                  .format(args.resume, checkpoint['epoch']))
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

But with the mini-batch in addition to the epoch? If this isn’t the correct use of a checkpoint, what method of serialization should I use to save and restore the state (gradients, weights, epoch_num, minibatch_num) during training?