Thank you for your quick response. So the ImageNet Example on GitHub is misusing checkpoints? See:
- PyTorch Discussion on Saving and Restoring Training and the related:
What I am really trying to ask is, can I do what is shown here:
save_checkpoint({
'epoch': epoch + 1,
'arch': args.arch,
'state_dict': model.state_dict(),
'best_prec1': best_prec1,
'optimizer' : optimizer.state_dict(),
}, is_best)
with:
def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'):
torch.save(state, filename)
if is_best:
shutil.copyfile(filename, 'model_best.pth.tar')
and resuming from the checkpoint:
if args.resume:
if os.path.isfile(args.resume):
print("=> loading checkpoint '{}'".format(args.resume))
checkpoint = torch.load(args.resume)
args.start_epoch = checkpoint['epoch']
best_prec1 = checkpoint['best_prec1']
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
print("=> loaded checkpoint '{}' (epoch {})"
.format(args.resume, checkpoint['epoch']))
else:
print("=> no checkpoint found at '{}'".format(args.resume))
But with the mini-batch in addition to the epoch? If this isn’t the correct use of a checkpoint, what method of serialization should I use to save and restore the state (gradients, weights, epoch_num, minibatch_num) during training?