Hello all,
I am a beginner in using transformer architecture and am trying to implement image captioning model. For this, I am using Encoder-Decoder architecture using transformer architecture. I am using 6 heads in both encoder and decoder blocks. I have a doubt in this regard.
Please tell whether my understanding is correct or not: "We need to use the encoder’s self-attention output from the 6th head for processing in decoder’s cross-attention? So that the same key, value will be used 6 times with different query (query generated by decoder’s self-attention) in decoder’s cross-attention.