Why W_q matrix in nn.MultiheadAttention is quadratic

Good afternoon,
I am trying to implement nn.MultiheadAttention in my network. According to the docs,

embed_dim – total dimension of the model.

However, according to the source file,

embed_dim must be divisible by num_heads


self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))

If I understand properly, this means each head takes only a part of features of each query, as the matrix is quadratic. Is it a bug of realization or is my understanding wrong? Thanks in advance!

@ptrblck could you please have a look?