Loading model and optimizer and then fine-tuning only last layer

I am trying to train a model with SGD(lr = 0.001, momentum = 0.9) and save it after training in order to then adapt it to a subset of the training speakers by freezing all the weights and only training the ones in the last layer. How should I manage the loading of the optimizer and changing its parameters to only the last layer? I know how to freeze the layers but then the optimizer size doesn’t match the loaded one:

# Load the model and optimizer state_dicts
    checkpoint = torch.load(modelPath,map_location=device)

    # Apply pretrained model weights
    model.load_state_dict(checkpoint['model_state_dict'])

    # Freeze the weights of the first layers
    for param in model.parameters():
        param.requires_grad = False

    # Enable weight updates in the last layer
    for param in model.output_layer.parameters():
        param.requires_grad = True

    # Define the optimizer
    optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), learning_rate, momentum=0.9)
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])  #doesn't match de size of the new optimizer

Hi David!

It seems to me that the simplest solution would be not to load the stored
state of the optimizer. Just create the new optimizer and use it.

It is true that because its momentum is non-zero, optimizer does carry
non-trivial state – the accumulated momenta from the gradients of prior
(pre-saving) optimization steps. But I don’t think that zeroing out those
momenta and then starting to reaccumulate them as you take optimization
steps for for the weights in the last output_layer that you are fine tuning
is likely to matter much.

(Hypothetically, you could try to dig the final-layer momenta out of the stored
optimizer_state_dict, but I wouldn’t bother with it. I’m not aware of any
pytorch feature that would do this for you at a conveniently high level.)

Best.

K. Frank

1 Like