I have a pretrained model that I trained on ImageNet data. I stored the model.state_dict() and the optimizer.state_dict() in a checkpoint file.
Now I would like to train a model on VGGFace2 data with the pretrained weights of the model trained on ImageNet to speed up training. This means that I have to change some of the layers in my model. I have seen that I can add the strict=False flag when I load the model.state_dict() to only add the weights to the layers that are present in both the old and the new model. But there is no such option for the optimizer.state_dict().
How can I resume training with both the model.state_dict() and optimizer.state_dict() on another model than they were trained on?
What do you mean with optimizer.load_state_dict(pathload[])?
But as discussed here, I think you are right. You only need to load the optimizer state if you want to continue training on the same data with the same network. Otherwise the model.state_dict() is sufficient.
I am familiar with how to resume training and loading the model.state_dict() and the optimizer.state_dict()
My problem was that I had to load the state_dicts into another network. Whereas for the model.state_dict() you can pass the strict=False flag to tell PyTorch to only load the parameters of the layers that are present in both the old and the new network, there is no such option for the optimizer.state_dict(). However, apparently this is not a problem since you donāt need the optimizer.state_dict() if you donāt continue on the same network.
Essentially, my use case does not really fall under the category āresume trainingā but more so āstart training from scratchā but with better initial parameters compared to random initialisation.
So theoretically it would be interesting to see what the āoptimizer stateā consists of and if the state is dependent on the parameters, which it tries to train.
If the optimizer does not depent on the parameters there will be no issue with loading the current state with different parameters.
If it is dependent, it is questionable if the dependence is separated for each parameter.
In this case it may be possible to load the state of the optimizer without all parameters.
In the case of the optimizer having the second derivative saved in its state, it will not be possible. Or at least I would not know how .