Regarding (3 * embed_dim) in nn.MultiheadAttention

vainaijr · October 28, 2019, 7:12am

why is this 3 * embed_dim in implementation of nn.MultiheadAttention?
what does it mean?

self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))

from what I understand, when I do

x = nn.MultiheadAttention(embed_dim, number_of_heads)

then, we have Q, K, V which are representations of words in embedding form.
in case of MultiheadAttention, we divide this embed_dim by number_of_heads, so if embed_dim is 10, and number_of_heads is 2, then first head would use first 5 numbers, second head would use next 5 numbers.

where, Q is word which we want to get embedding for, K are all nearby words, V also represents all nearby words.

what is mistake in this explanation?
and why do we use 3 * embed_dim?