I don’t understand how
nn.MultiheadAttention module work.
What is the meaning of
__init__? Are they related to key and value parameters in
Why must the
embedding_dimbe divisible by
num_heads? What is the meaning of
head_dimderived from such a fomulation?
multi-head self-attentionworks this way:
embedding_dimare projected into
- after computing attention weights between query and key, the
value_dimare collected according to the attention weights of each query, which forms exactly the output of current attention head.
- Finally output of all heads are concatenated into one matrix of
value_dimand projected into
embedding_dim. Thus, I don’t see where should the module divide
- What is the difference between