I’m trying to resume training and I am using torch.optim.lr_scheduler.MultiStepLR for decreasing the learning rate. I noticed the constructor accepts a last_epoch parameter. So I tried to set it to the last epoch in which my checkpoint is made, and then simply resume training from that epoch forward.
When I tried to send a value for this parameter, I faced the error :
KeyError: “param ‘initial_lr’ is not specified in param_groups[0] when resuming an optimizer”
I have no idea what this means and how to get around it. The documentation is also vague to me and I really cant understand it. it talks about initial_lr, but there is no parameter named as such . I’m completely lost here!
Any help is greatly appreciated.
That solution works for version 0.4, however, the problem with 0.4 is that I face out of memory error while training, it seems the 0.4 version takes more VRAM compared to 0.3.1 and that’s why I’m forced to stick with 0.3.1. Now I need another way to get the resume to work. If it wasn’t because of that out of memory issue, all was good! Anyway here is a sample snippet of code to show how I did it that causes the error:
criterion = nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), args.lr,
momentum=args.momentum,
weight_decay=args.weight_decay,
nesterov=True)
# epoch
milestones = [30, 60, 90, 130, 150]
scheduler = lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=36)
# optionally resume from a checkpoint
if args.resume:
if os.path.isfile(args.resume):
print_log("=> loading checkpoint '{}'".format(args.resume), log)
checkpoint = torch.load(args.resume)
args.start_epoch = checkpoint['epoch']
best_prec1 = checkpoint['best_prec1']
if 'best_prec5' in checkpoint:
best_prec5 = checkpoint['best_prec5']
else:
best_prec5 = 0.00
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
model.eval()
print_log("=> loaded checkpoint '{}' (epoch {})".format(args.resume, checkpoint['epoch']), log)
else:
print_log("=> no checkpoint found at '{}'".format(args.resume), log)
cudnn.benchmark = True
No, I haven’t. I am in the middle of training (reverted back to 0.3.1 and resumed the training the old fasion way)
When the training is finished I’ll update to version 0.4 and give it a try again