Does embed dimemsion need to be divisible by num of heads in MultiheadAttention just because of parallel work?

When using nn. Transformer the size of


must be divvided by


  1. What is the reason for this restriction ?
  2. Is the reason is a parallel work ? [The sequence will be split along the emb_dim dimension ?]

Yes, inputs to MHA are split along the hidden dimension, which is d_model due to d_model → d_model linear projection.

But now I wonder if having d_model as hidden dimension is too restrictive?