Question1 : k_dim and v_dim
are the dimensions of your inputs(same as key and value in self attention)
Question 2 : It must be divisible because. embedding_dim
is divided across different heads. So if your embedding_dim = 300
and you have num_heads = 2
. The first head words on 150
part of the embedding and the second head works on the other 150
, the results of the two heads are later concatenated.
Please watch torch.nn.MultiheadAttention. It can clear all your doubts