Using a pretrained model on another dataset

Hi everyone :slight_smile:

I have a pretrained model that I trained on ImageNet data. I stored the model.state_dict() and the optimizer.state_dict() in a checkpoint file.

Now I would like to train a model on VGGFace2 data with the pretrained weights of the model trained on ImageNet to speed up training. This means that I have to change some of the layers in my model. I have seen that I can add the strict=False flag when I load the model.state_dict() to only add the weights to the layers that are present in both the old and the new model. But there is no such option for the optimizer.state_dict().

How can I resume training with both the model.state_dict() and optimizer.state_dict() on another model than they were trained on?

Any help is very much appreciated!

All the best,

Have you checked

I think for most of the optimizer you don’t need to load their state.

Hi @CedricLy, thank you for your response!

What do you mean with optimizer.load_state_dict(pathload[])?

But as discussed here, I think you are right. You only need to load the optimizer state if you want to continue training on the same data with the same network. Otherwise the model.state_dict() is sufficient.

model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)

checkpoint = torch.load(PATH)
epoch = checkpoint['epoch']
loss = checkpoint['loss']

# - or -

This is from a tutorial how to load if you want to continue a training.
Check this link for more details.
Does this already solve your issue?

I am familiar with how to resume training and loading the model.state_dict() and the optimizer.state_dict() :slight_smile:

My problem was that I had to load the state_dicts into another network. Whereas for the model.state_dict() you can pass the strict=False flag to tell PyTorch to only load the parameters of the layers that are present in both the old and the new network, there is no such option for the optimizer.state_dict(). However, apparently this is not a problem since you don’t need the optimizer.state_dict() if you don’t continue on the same network.

Essentially, my use case does not really fall under the category ‘resume training’ but more so ‘start training from scratch’ but with better initial parameters compared to random initialisation.

Hope that makes sense!

Ah I see.

So theoretically it would be interesting to see what the “optimizer state” consists of and if the state is dependent on the parameters, which it tries to train.
If the optimizer does not depent on the parameters there will be no issue with loading the current state with different parameters.
If it is dependent, it is questionable if the dependence is separated for each parameter.
In this case it may be possible to load the state of the optimizer without all parameters.

In the case of the optimizer having the second derivative saved in its state, it will not be possible. Or at least I would not know how .