I am trying to implement a paper where the authors have made some additions to the Encoder of a pretrained ViT: performing a convolution with a learnable filter on the attention maps.
According to them, they have used the official implementation (and made changes in this class: https://github.com/jeonsworld/ViT-pytorch/blob/main/models/modeling.py#L56C3-L56C3) - loading in the pretrained checkpoints for image classification from HuggingFace which means the untouched layers retain their weights, and the new additions (i.e. the Convolution here) have their weights randomly initialized.
How can I do this in PyTorch? What are the steps to accomplish this? If I use the original ViT repo and make the additions there, then download the checkpoint from HuggingFace, is it as simple as calling
load_state_dict() or something similar (it’ll match the layers left alone automatically while ignoring the new additions)? If this is the case, is it as simple to do this for other problems: perhaps making some changes to BERT then loading in the pretrained ckpt from HF to test whether the finetuning process for a downstream task performs better?
I found another topic (How to load my pre-trained model into a modified new one model?) that seems to ask a similar question, but I’m unsure if this is the same situation (changes vs. additions, and loading in from an official checkpoint).