Problems with the optimiser when resuming

I have a NN with 4 heads that I want to train one after the other.

After having trained 1 head, I save the net and I then want to resume the model, change the requires_grad of the already trained head and start training the next head but I get this error:
ValueError: loaded state dict contains a parameter group that doesn’t match the size of optimizer’s group

I get the same error even if the requires_grad is changed after having loaded the model.

I do not get any error if I do not make any change to requires_grad. Meaning that resume works fine.

The problem arises when I load the optimiser. The model can be loaded without problems.

Observe that the the Net structure never changes in time. It is the trainability of some layers that are switched.

How can I solve this problem?

Are you filtering the parameters based on their requires_grad attribute before passing them to the optimizer? Also, could you post an executable code snippet to reproduce this issue, as I’m currently unsure what might cause the error?

Hello, first of all thanks for helping.

My code is structured in this way:

  1. I initially create the net structure, with all param_grad=True by default
  2. depending on information on a config file I switch to False the param_grad of all heads but one. For instance the code to switch to False the heat map head is the following:
if config["trainable_heads"]["heatmap"] == False:
      enc_hm = map(lambda x: x[1],
                                  filter(lambda p: p[1].requires_grad and not ("backbone.decoder_delta" in p[0]),
                                         model.named_parameters()))
      for param in enc_hm:
          param.requires_grad = False 

I have 3 other identical snippet for the other 3 heads.

Then I train and save the net.
When I resume, I get the error, which only happens if I try to switch the trainable head. If I do not switch heads, the resume works perfectly.

An attempt I made is the following: before resuming from previous training, I set the layers of the network in the exact same trainability status as just before saving. I checked it was the case by comparing the name of all trainable parameters just before saving and just before resuming. The 2 list are indeed identical and as I would have expected. Which confuses me a lot.

The error is when I load the optimiser:

            self.model.load_state_dict(checkpoint['state_dict'])
            self.scheduler.load_state_dict(checkpoint['scheduler']) 
            self.optimizer.load_state_dict(checkpoint['optimizer'])

No error is raised in the first 2.

I cannot reproduce the issue using different work flows of freezing different sets of parameters:

model = models.resnet18()
for param in model.fc.parameters():
    param.requires_grad = False

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

output = model(torch.randn(1, 3, 224, 224))
output.mean().backward()
optimizer.step()
optimizer.zero_grad()

torch.save(model.state_dict(), 'model.pt')
torch.save(optimizer.state_dict(), 'opt.pt')

# load plain model and optimizer
model = models.resnet18()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
model.load_state_dict(torch.load('model.pt'))
optimizer.load_state_dict(torch.load('opt.pt'))

# load model with same frozen parameters
model = models.resnet18()
for param in model.fc.parameters():
    param.requires_grad = False
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
model.load_state_dict(torch.load('model.pt'))
optimizer.load_state_dict(torch.load('opt.pt'))

# load model with different frozen parameters
model = models.resnet18()
for param in model.conv1.parameters():
    param.requires_grad = False
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
model.load_state_dict(torch.load('model.pt'))
optimizer.load_state_dict(torch.load('opt.pt'))

Thank you. I will go through my code once again. There must be some bug there.

Hello,

From what I see from your code, you always pass all networks parameters to the Adam optimizer, regardless requires_grad value of each of them.

I thought I had to pass to the optimiser only the trainable ones. This is incorrect, right?

Thanks for helping

You could filter them out before, if you like. As long as the .grad attributes aren’t filled in the frozen parameters, the optimizer won’t update them, but your approach would be more explicit.
Were you able to reproduce the issue using my code snippet and the filtering or any other addition to the code?