Does embed dimemsion need to be divisible by num of heads in MultiheadAttention just because of parallel work?

When using nn. Transformer the size of

d_model

must be divvided by

nhead

  1. What is the reason for this restriction ?
  2. Is the reason is a parallel work ? [The sequence will be split along the emb_dim dimension ?]

Yes, inputs to MHA are split along the hidden dimension, which is d_model due to d_model → d_model linear projection.

But now I wonder if having d_model as hidden dimension is too restrictive?