When using nn.
Transformer
the size of
d_model
must be divvided by
nhead
- What is the reason for this restriction ?
- Is the reason is a parallel work ? [The sequence will be split along the
emb_dim
dimension ?]
When using nn.
Transformer
the size of
d_model
must be divvided by
nhead
emb_dim
dimension ?]Yes, inputs to MHA are split along the hidden dimension, which is d_model due to d_model → d_model linear projection.
But now I wonder if having d_model as hidden dimension is too restrictive?