Size mismatch (weights and biases)

Ascrypto · September 9, 2022, 6:41pm

Hi,

I have two linear-layer networks (one is a pre-trained model). The only differences between the two networks are the input and output sizes. When I call: new_model.load_state_dict(torch.load(“model.pth”)), I get the size mismatch issues.

Solutions I have tried so far but none of them worked:

strict=False when loading the state.
manually changing the parameters after loading the state dict.

I am not sure how it can be solved.

Also, what are the consequences (in terms of predictive validity) of testing using different neural network architectures (as the one described above)?

Thank you.

Ascrypto · September 10, 2022, 5:39am

In case someone is looking for a solution:
I loaded the model: chk=torch.load(“pathtofile”)
chk.pop(‘head.weight’)
chk.pop(‘head.bias’)
chk.pop(‘hidden_layer1.weight’)
and then load_state_dict(chk)

This solved the issue above, but I am not sure if it is a technically valid method since as far as I know higher layers’ weights and biases should not be discarded.

ptrblck · September 10, 2022, 5:53am

Your approach just skips loading the parameters which re causing a shape mismatch and is theoretically valid. Of course if depends on your use case as you are now using these parameters with their random init values and would need to retrain them. However, since you’ve changed the number of in and output features you won’t be able to directly load the pretrained parameters unless you can come up with a way to “use” the pretrained parameters and reshape them somehow (e.g. repeating the values etc.).

Ascrypto · September 10, 2022, 7:20pm

Thank you for your input. I just want to clarify. So you suggest something similar to random projection (i.e., reducing the dimensions without losing the underlying characteristics), or is there a simpler approach? Could you please elaborate on your last point (by examples if possible)?

ptrblck · September 10, 2022, 9:20pm

I’m not suggesting a specific approach as it depends on your actual use case and why the parameter shapes were changed.
In a common fine-tuning approach you would e.g. just replace the classifier from the pre-trained model and fine-tune it afterwards, which doesn’t seem to fit your use case.