doing some tests, I noticed that increasing or decreasing the number of heads of the multi head attention the total number of learnable parameters of my model does not change.
Is this behavior correct? And if so, why?
Shouldn’t the number of heads affect the number of parameters the model can learn?
Yes exactly, even if it leaves me a little perplexed …
For example in transformers, changing the number of heads it seems to me that the total number of trainable parameters change