Why W_q matrix in nn.MultiheadAttention is quadratic

aktsvigun · August 30, 2020, 12:58pm

Good afternoon,
I am trying to implement nn.MultiheadAttention in my network. According to the docs,

embed_dim – total dimension of the model.

However, according to the source file,

embed_dim must be divisible by num_heads

and

self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))

If I understand properly, this means each head takes only a part of features of each query, as the matrix is quadratic. Is it a bug of realization or is my understanding wrong? Thanks in advance!

aktsvigun · August 31, 2020, 6:56am

@ptrblck could you please have a look?