Let src be the input to the decoder. src.size() = (seq_len, embed_dim_1). Then the linearly transformation yields Q= src*W_query has new dimensions of (seq_len, embed_dim_2)
However it appears that the Pytorch implementation of the Transformer enforces embed_dim_1 = embed_dim_2.
- Why is this?
- Is this the standard variation of transformers used in Large language models like BERT, GPT?
- Is there a way for me to allow the dimensions to be different, or do I need to build my own transformer?