What is the proper way of using last_epoch in a lr_scheduler?

Shisho_Sama · June 26, 2018, 7:03pm

I’m trying to resume training and I am using torch.optim.lr_scheduler.MultiStepLR for decreasing the learning rate. I noticed the constructor accepts a last_epoch parameter. So I tried to set it to the last epoch in which my checkpoint is made, and then simply resume training from that epoch forward.
When I tried to send a value for this parameter, I faced the error :

KeyError: “param ‘initial_lr’ is not specified in param_groups[0] when resuming an optimizer”

I have no idea what this means and how to get around it. The documentation is also vague to me and I really cant understand it. it talks about initial_lr, but there is no parameter named as such . I’m completely lost here!
Any help is greatly appreciated.

By the way, I’m using Pytorch 0.3.1

ptrblck · June 26, 2018, 7:23pm

Could you provide a small code snippet resulting in this error?
Did the approach from the other thread not work?

Shisho_Sama · June 26, 2018, 11:48pm

That solution works for version 0.4, however, the problem with 0.4 is that I face out of memory error while training, it seems the 0.4 version takes more VRAM compared to 0.3.1 and that’s why I’m forced to stick with 0.3.1. Now I need another way to get the resume to work. If it wasn’t because of that out of memory issue, all was good! Anyway here is a sample snippet of code to show how I did it that causes the error:

criterion = nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), args.lr,
            momentum=args.momentum,
            weight_decay=args.weight_decay,
            nesterov=True)

# epoch 
milestones = [30, 60, 90, 130, 150]
scheduler = lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=36)

# optionally resume from a checkpoint
if args.resume:
    if os.path.isfile(args.resume):
        print_log("=> loading checkpoint '{}'".format(args.resume), log)
        checkpoint = torch.load(args.resume)
        args.start_epoch = checkpoint['epoch']
        best_prec1 = checkpoint['best_prec1']
        if 'best_prec5' in checkpoint:
            best_prec5 = checkpoint['best_prec5']
        else:
            best_prec5 = 0.00
        model.load_state_dict(checkpoint['state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer'])
        model.eval()
        print_log("=> loaded checkpoint '{}' (epoch {})".format(args.resume, checkpoint['epoch']), log)
    else:
        print_log("=> no checkpoint found at '{}'".format(args.resume), log)

cudnn.benchmark = True

ptrblck · June 29, 2018, 9:38am

I think we should have a look at the memory issue first.
Did you create a separate thread already for it?

Shisho_Sama · June 30, 2018, 7:56am

No, I haven’t. I am in the middle of training (reverted back to 0.3.1 and resumed the training the old fasion way)
When the training is finished I’ll update to version 0.4 and give it a try again

Shisho_Sama · June 30, 2018, 3:09pm

Asked the new question concerning the out of memory issue in 0.4 :