Hi everyone!

I am trying to build a multi-head attention model that has different values for d_k, d_v, and d_model (where d_k, d_v, and d_model are as in https://arxiv.org/pdf/1706.03762.pdf)

I do

`attention = nn.MultiheadAttention(embed_dim=200, num_heads=20, kdim=300, vdim=300, batch_first=True)`

and then with

`input = torch.randn((10, 4, 300)) output, output_weights = attention(input, input, input)`

I get an error

`assert embed_dim == embed_dim_to_check, \ AssertionError: was expecting embedding dimension of 200, but got 300`

I realise that mathematically we can just have embed_dim (AKA dim_model) equal 300 and then just add a linear projection so that we have output multiplied by a matrix W_O’ of dimensions 300 x 200 which will be mathematically equivalent to what we need but I don’t understand why is the above assertion needed here?

This is somewhat similar to this Not possible to use different key/value dimensionalities in nn.MultiheadAttention · Issue #27623 · pytorch/pytorch · GitHub, however we need d_model and d_key to be different, not d_v and d_k.

What am I missing? Thanks in advance