why is this 3 * embed_dim in implementation of nn.MultiheadAttention?

what does it mean?

```
self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))
```

from what I understand, when I do

```
x = nn.MultiheadAttention(embed_dim, number_of_heads)
```

then, we have Q, K, V which are representations of words in embedding form.

in case of MultiheadAttention, we divide this embed_dim by number_of_heads, so if embed_dim is 10, and number_of_heads is 2, then first head would use first 5 numbers, second head would use next 5 numbers.

where, Q is word which we want to get embedding for, K are all nearby words, V also represents all nearby words.

what is mistake in this explanation?

and why do we use 3 * embed_dim?