I’m looking at using attention for the first time, and looking at the documentation for `nn.MultiheadAttention()`

I’m not sure what data I’m meant to be using as input for the Q,K, and V arguments.

IMO, this question is barely related to pytorch since it’s a transformer architecture and MultiHeadAttention is an inner element for the Encoder and Decoder layers.

Anyway, here’s an explanation from architecture standpoint:

Thanks very much for your response and the link.

While it’s helped my conceptual understanding, I’m still not sure exactly what I should be using as inputs for those arguments. Hence why I posted to the PyTorch forum as it’s the framework I’m using.

If the following is true (as per one of the answers in the link):

```
Query = I x W(Q)
Key = I x W(K)
Value = I x W(V)
```

where I is the input (encoder) state vector, and W(Q), W(K), and W(V) are the corresponding matrices to transform the I vector into the Query, Key, Value vectors.

Assuming `I`

is just the input into the attention block , how are W(Q), W(K), and W(V) generated or are these linear layers as the literature states they are “learnable” matrices?

This seems to have answered my question.

I find the original paper better explaining why: