Load_state_dict padding and resizing pre trained model


I’d like to ask advise on a functionality I would expect from load_state_dict.

The current implementation will fail on a size mismatch or key mismatch. Specifying strict=False will ignore and skip such cases.

My desired functionality is to address the size mismatch in a way that allows the user to benefit from pre-trained tensors, despite a structural change in the network.


  • in case the new tensor is larger, randomly initialize (or zero init) the residual.
  • in case the new tensor is smaller, retain the overlapping shape and throw away the rest.

Is it possible to do this?

I often find myself with this need: I pre-train a smaller and less deep model - plateau in terms of loss - subsequently I’d like to carry this experience towards the next model generation: a broader and/or deeper model - saving me some training time.

Yes, this should be possible my directly manipulating the stored parameters and buffers in the state_dict using your padding/cropping technique. You would then need to experiment with these techniques to see how much worse the overall training would be and if a training from scratch would yield the same results.