Number of parameters of MultiheadAttention

Hi everyone,

doing some tests, I noticed that increasing or decreasing the number of heads of the multi head attention the total number of learnable parameters of my model does not change.
Is this behavior correct? And if so, why?

Shouldn’t the number of heads affect the number of parameters the model can learn?

Thanks! :slight_smile:

EDIT: obv I use the pytorch implementation for the MultiheadAttention

1 Like

If I understand the code correctly, it seems the embed_dim will be split by the number of heads, which would not increase the parameter size.

1 Like

Yes exactly, even if it leaves me a little perplexed …
For example in transformers, changing the number of heads it seems to me that the total number of trainable parameters change