Error when loading Adam optimizer : Resume training

Mdhvince · July 17, 2019, 7:43pm

Hi ,
I want to resume training but I got an error when loading my optimizer.
States was saved like this:

params = [p for p in model.parameters() if p.requires_grad]
optimizer = optim.Adam(params = params, lr = 0.001)
...

torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict()
}, saved_model_path)

I’m trying to load the state like this:

checkpoint = torch.load(MODELPATH, map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])

params = [p for p in model.parameters() if p.requires_grad]
optimizer = optim.Adam(params = params, lr = 0.001)
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
		
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.to(device)

epoch_resume = checkpoint['epoch']

But I got this error:

Traceback (most recent call last):
  File "train_keypoints_rcnn.py", line 128, in <module>
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 115, in load_state_dict
    raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

The model was trained on GPU Colab with no LRScheduler.

Is something wrong in my code?
Please help me solve this error if you have an idea.

Thank you.
Medhy

Oli · July 17, 2019, 9:10pm

Hey Vince,

Looks fine to me. One possible error could be that you freeze some layers in your model (set the requires_grad to False) before creating your optimiser and before saving -> but forget to freeze your model when you’re resuming training.

Perhaps it would help to print the shapes of your params before saving and before loading to see if they match?
[print(p.size()) for p in params]

Mdhvince · July 18, 2019, 5:49am

Thank you for your response. I thought the problem was only on the optimizer, in fact the problem was on my model. I forgot that I’ve changed the last layer of the network and now everything is working.

n0obcoder · August 21, 2019, 7:28am

I save model(full model not the state_dict of the model) and i save the optimizer.state_dict

Then i load my model and freeze some layers and i define the optimizer again and then load the optimizers state_dict…

It throws an ugly error-
ValueError: loaded state dict contains a parameter group that doesn’t match the size of optimizer’s group

Can you pls help me get it corrected ?

axymis · December 8, 2019, 5:31am

Ok, I was having this error, too. My network is very close to the one at https://github.com/michhar/pytorch-yolo-v3-custom. The main difference between my network and the one in that GitHub repo is my use of the Adedelta optimizer rather than SGD.

Anyway, if you’re “freezing” any part of your network, and your optimizer is only passed “unfrozen” model parameters (i.e. your optimizer filters out model parameters whose requires_grad is False), then when resuming, you’ll need to unfreeze the network again and re-instantiate the optimizer afterwards. See the following code, which loads from a checkpoint:

def load_checkpoint(checkpoint_fpath, model, optimizer):
    # Load the state dicts from file
    checkpoint = torch.load(checkpoint_fpath)
    # Load for model
    model.load_state_dict(checkpoint['state_dict'])

    # Unfreeze model & have to re-instantiate optimizer
    unfreeze_layers(model, stop_layer)
    optimizer = optim.Adadelta(filter(lambda p: p.requires_grad, model.parameters()), lr=1.0, rho=0.95, eps=1e-08)

    # Load for optimizer
    optimizer.load_state_dict(checkpoint['optimizer'])

    return model, optimizer

And to save:

checkpoint = {
    'state_dict': model.state_dict(),
    'optimizer': optimizer.state_dict()
}
torch.save(checkpoint, "savefilepath.pth")

And you can include epoch in this checkpoint dictionary very easily, by just adding another entry such as 'epoch': currentEpoch. This can be then be used in load_checkpoint(), but you’ll have to modify the function yourself.

TL;DR: Whatever freezing/unfreezing operation you performed last before saving has to be done again before instantiating the optimizer.

It seems obvious now (as the optimizer very evidently filters out frozen parameters), but it took me much longer than I care to admit to figure it out… Hope this helps someone!