Multiheaded attention - What are the query, key and values?

I’m looking at using attention for the first time, and looking at the documentation for nn.MultiheadAttention() I’m not sure what data I’m meant to be using as input for the Q,K, and V arguments.

IMO, this question is barely related to pytorch since it’s a transformer architecture and MultiHeadAttention is an inner element for the Encoder and Decoder layers.

Anyway, here’s an explanation from architecture standpoint:

Thanks very much for your response and the link.

While it’s helped my conceptual understanding, I’m still not sure exactly what I should be using as inputs for those arguments. Hence why I posted to the PyTorch forum as it’s the framework I’m using.

If the following is true (as per one of the answers in the link):

Query = I x W(Q)

Key = I x W(K)

Value = I x W(V)

where I is the input (encoder) state vector, and W(Q), W(K), and W(V) are the corresponding matrices to transform the I vector into the Query, Key, Value vectors.

Assuming I is just the input into the attention block , how are W(Q), W(K), and W(V) generated or are these linear layers as the literature states they are “learnable” matrices?

This seems to have answered my question.

I find the original paper better explaining why: