Questions about `torch.nn.MultiheadAttention`

@namespace_pt @steffenN

Question1 : k_dim and v_dim are the dimensions of your inputs(same as key and value in self attention)
Question 2 : It must be divisible because. embedding_dim is divided across different heads. So if your embedding_dim = 300 and you have num_heads = 2. The first head words on 150 part of the embedding and the second head works on the other 150, the results of the two heads are later concatenated.

Please watch torch.nn.MultiheadAttention. It can clear all your doubts