I recently noticed that increasing the number of heads for a MultiHeadAttention layer doesn’t increase the number of parameters that the layer uses. I wonder why this is? Since each head is basically running self attention on the input and calculating a set of attention weights, shouldn’t increasing the number of heads increase the number of parameters used?
^edit: I’m using
[(name, layer.numel()) for name, layer in model.named_parameters()] to print the layer size