I’ve a conceptual question
BERT-base has a dimension of 768 for query, key and value and 12 heads (Hidden dimension=768, number of heads=12). The same is conveyed if we see the BERT-base architecture
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
Now, my question is:
Can I consider the first 64 neurons from the out_features as the first-head, the next 64 neurons from the out_features as the 2nd head and so on? (sec 3.2.2 from original paper; Link)
Basically, I am wondering if the Linear module representing query matrix; which is 768x768 can be thought as (768x64), (768x64)…12 times? The same for key and value modules
If so, is it possible to provide some starter code as I am unable to wrap around my head. Any help is appreciated (and I’ve some sample in the contribution section)
P.S: Here’s the issue I raised in Github (link)