As mentioned in Attention Is All You Need, we should apply softmax function on result of (QK/sqrt(dk)) to achieve weights or attention score for each sequence element (like words). It’s unclear for me why we need to apply softmax on columns of feature vectors? I mean, according to PyTorch implementation of
multi_head_attention_forward softmax is applied with dim=-1. From what I understood, we need to get attention score for each word using softmax, so we should apply softmax on rows to multiply it by Value Embedding of each word. Why is this not happening?
For example in FEED-FORWARD NETWORKS WITH ATTENTION CAN SOVLE SOME LONG-TERM MEMORY PROBLEMS, researchers proposed attention mechanism which apply softmax on hidden states. in other words, softmax is applying on rows of result matrix (each row assumed to be hidden state output).
I appreciate for any guidance.