doing some tests, I noticed that increasing or decreasing the number of heads of the multi head attention the total number of learnable parameters of my model does not change.
Is this behavior correct? And if so, why?
Shouldn’t the number of heads affect the number of parameters the model can learn?
EDIT: obv I use the pytorch implementation for the MultiheadAttention