Questions about `torch.nn.MultiheadAttention`

Question1 : k_dim and v_dim are the dimensions of your inputs(same as key and value in self attention)
Question 2 : It must be divisible because. embedding_dim is divided across different heads. So if your embedding_dim = 300 and you have num_heads = 2. The first head words on 150 part of the embedding and the second head works on the other 150, the results of the two heads are later concatenated.

