Hi,
I am looking for a way to slightly modify Hugging Face GPT-2’s architecture by inserting a custom feedforward layer inside a GPT-2 decoder block, right after the masked self-attention sublayer. I want to then initialize all original parameters with pre-trained GPT-2 weights and the newly added ones randomly. Is there a way to achieve this by inheriting Hugging Face’s GPT-2 model, instead of copying Hugging Face’s modeling_gpt2 file and then making changes to it?
I’d be really grateful if someone could guide me or point me in the right direction.