Loading optimizers state_dict throws an error

n0obcoder · August 21, 2019, 7:36am

I am not sure how model and optimizers work together in pytorch.

Here is the thing,

I save my model(full model, not the state_dict of the model) and i save the optimizer.state_dict

torch.save({
            'model': model, # it saves the whole model
            'optimizer_state_dict': optimizer.state_dict(),
            'lr_scheduler_state_dict': lr_scheduler.state_dict(),
            }, save_path)

Then i load my model and freeze some layers and i define the optimizer again and then load the optimizers state_dict…

ckpt = torch.load(save_path) 
model = ckpt['model']

for name, param in model.named_parameters():
    if ('layer4' in name) or ('fc' in name):
        param.requires_grad = True 
    else:
        param.requires_grad = False
        
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr = lr)
optimizer.load_state_dict(ckpt['optimizer_state_dict'])

exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
exp_lr_scheduler.load_state_dict(ckpt['lr_scheduler'])

It throws an ugly error-
ValueError: loaded state dict contains a parameter group that doesn’t match the size of optimizer’s group

Can you pls help me get it corrected ?

Any why do we even need to save the state_dict of the optimizers and the scheduler ?

liyz15 · August 21, 2019, 7:48am

It could be you filter out some parameters in the new optimizer, resulting mismatch. Removing the filter should resolve the issue.

Saving the state_dict of optimizer and scheduler helps to resume training.

n0obcoder · August 21, 2019, 7:56am

@liyz15, thanks for replying

I get it now, that the error is because of the mismatch of the parameters
but i can not remove the filters, since i am freezing some layers of the model.

What if after freezing the layers, i define a new optimizer ?
Would it be any different from loading the state_dict of the previously saved optimizer ?

liyz15 · August 21, 2019, 8:46am

I believe setting requires_grad=False should be enough to freeze. See https://stackoverflow.com/questions/53159427/pytorch-freeze-weights-and-update-param-groups

As long as you set requires_grad=False before forward and use optimizer.zero_grad(), it won’t be updated.

Adam optimizer maintains learning rate adaptively, so it has internal state changing with training. It’s different between new and trained. Generally, if you want to continue training, load from state_dict.

n0obcoder · August 21, 2019, 9:36am

Hi @liyz15
Thanks for the explaination once again !
But what do you suggest me to do in my case where, i have to freeze layers after some epochs of training ?
Do you suggest me to define a new optimizer after freezing the layers of the model ?

liyz15 · August 21, 2019, 10:01am

What’s the purpose of freezing layers? If you are trying to finetune on different dataset, a new one is preferred. If it’s some training techniques to freeze some layer during training, then continue with the same one.

BTW, saving directly the entire model may not be the best practice, save model.state_dict() is recommended, see https://github.com/pytorch/pytorch/blob/761d6799beb3afa03657a71776412a2171ee7533/docs/source/notes/serialization.rst

n0obcoder · August 21, 2019, 10:42am

Yes freezing layers is a technique of training a model. So i train the model for, lets say 5 epochs on the last 3 layers, then train further for 3 epochs for only the last 2 layers(freezing the thirds last layer) and so on…

And saving the model.state_dict() does not save the requires_grad attribute of the parameters of the model whearas saving the entire model does save it. Saving the entire model works as long as u are not changing the architecture of the model itself.

So, given that i have to freeze layers after a few epochs, do i have any option other than defining a new optimizer ?

liyz15 · August 21, 2019, 11:00am

Just use the old one, no need for filtering, requires_grad=False will do freezing.