I recently noticed that increasing the number of heads for a MultiHeadAttention layer doesn’t increase the number of parameters that the layer uses. I wonder why this is? Since each head is basically running self attention on the input and calculating a set of attention weights, shouldn’t increasing the number of heads increase the number of parameters used?
^edit: I’m using
[(name, layer.numel()) for name, layer in model.named_parameters()] to print the layer size
This is correct with how MultiHeadAttention is implemented in the original paper. Each head acts upon a reduced dimensionality of the original problem such that the computation cost (and also parameters) are comparable to that of one head.
There’s another way of performing MultiHeadAttention where the parameters (and thus computation) scales with the number of heads. It’ll basically mimic running several of Pytorch’s MultiHeadAttention layers, each with one head, in parallel. You’ll need to implement this or find an implementation.