Does embed dimemsion need to be divisible by num of heads in MultiheadAttention just because of parallel work?

laro · December 19, 2021, 5:28am

When using nn. Transformer the size of

d_model

must be divvided by

nhead

What is the reason for this restriction ?
Is the reason is a parallel work ? [The sequence will be split along the emb_dim dimension ?]

my3bikaht · December 19, 2021, 8:53am

Yes, inputs to MHA are split along the hidden dimension, which is d_model due to d_model → d_model linear projection.

But now I wonder if having d_model as hidden dimension is too restrictive?