I’ll try the intuition part…
You can think all the heads like a panel of people, in such a way that each head is a different person, it has its own thoughts and view of the situation (the head’s weights).
So each person give his output, and then there is a leader, that takes into account all the outputs of the panel, and gives out the final verdict, that leader is the final feed forward part of the multi head, it concatenates all the outputs from the heads, and feed it to a linear layer to produce final output.
Adding more heads will add more parameters.
As a side note, more heads does not mean better model, it’s a hyper parameter, and depends on the challenge.
Roy.