I have just started learning Transformer architecture. I suppose the number of heads should increase the number of QKV matrices. As a result, the number of learnable parameters should increase.
Why does the following code produce the same output?
from torch import nn
return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(count_parameters(nn.Transformer(nhead=8))) # 44140544
print(count_parameters(nn.Transformer(nhead=1))) # 44140544
In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.
It doesn’t have to be done this way, but that’s how nn.Transformer is implemented anyway.
For learning purposes, I’ve implemented my own Transformer architecture from scratch, and there I could easily keep model_size without dividing by the number of heads.
EDIT: Yes, with nn.Transformer you would need to multiple your intended model_size by the number of heads.