Multiheaded attention - What are the query, key and values?

Mole_m7b5 · December 15, 2021, 10:52am

I’m looking at using attention for the first time, and looking at the documentation for nn.MultiheadAttention() I’m not sure what data I’m meant to be using as input for the Q,K, and V arguments.

my3bikaht · December 15, 2021, 1:40pm

IMO, this question is barely related to pytorch since it’s a transformer architecture and MultiHeadAttention is an inner element for the Encoder and Decoder layers.

Anyway, here’s an explanation from architecture standpoint:

Mole_m7b5 · December 15, 2021, 2:02pm

Thanks very much for your response and the link.

While it’s helped my conceptual understanding, I’m still not sure exactly what I should be using as inputs for those arguments. Hence why I posted to the PyTorch forum as it’s the framework I’m using.

If the following is true (as per one of the answers in the link):

Query = I x W(Q)

Key = I x W(K)

Value = I x W(V)

where I is the input (encoder) state vector, and W(Q), W(K), and W(V) are the corresponding matrices to transform the I vector into the Query, Key, Value vectors.

Assuming I is just the input into the attention block , how are W(Q), W(K), and W(V) generated or are these linear layers as the literature states they are “learnable” matrices?

Mole_m7b5 · December 15, 2021, 2:09pm

This seems to have answered my question.

https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html

my3bikaht · December 15, 2021, 2:39pm

I find the original paper better explaining why: