Cannot assign different embed_dim and kdim in nn.MultiheadAttention

Hi everyone!

I am trying to build a multi-head attention model that has different values for d_k, d_v, and d_model (where d_k, d_v, and d_model are as in https://arxiv.org/pdf/1706.03762.pdf)

I do

attention = nn.MultiheadAttention(embed_dim=200, num_heads=20, kdim=300, vdim=300, batch_first=True)

and then with

input = torch.randn((10, 4, 300)) output, output_weights = attention(input, input, input)

I get an error

assert embed_dim == embed_dim_to_check, \ AssertionError: was expecting embedding dimension of 200, but got 300

I realise that mathematically we can just have embed_dim (AKA dim_model) equal 300 and then just add a linear projection so that we have output multiplied by a matrix W_O’ of dimensions 300 x 200 which will be mathematically equivalent to what we need but I don’t understand why is the above assertion needed here?

This is somewhat similar to this Not possible to use different key/value dimensionalities in nn.MultiheadAttention · Issue #27623 · pytorch/pytorch · GitHub, however we need d_model and d_key to be different, not d_v and d_k.

What am I missing? Thanks in advance

2 Likes

I’ll second this - the assertions around embed_dim are too restrictive. The embed_dim seems to be a “catch-all” parameter, although the multi-head attention fundamentally supports four different embeddings dimensions to coexist:

  1. The input dimension of the query - doesn’t need to be divisible by num_heads,
  2. The embed_dim for query and key - this should be divisible by num_heads,
  3. The embed_dim for value as well the input dimension of the out_projection - this should be divisible by num_heads,
  4. The output dimension of out_proj - doesn’t need to be divisible by num_heads.