Attention layer and softmax

abolfazl_saheban · November 18, 2022, 8:10pm

Hello, I want to implement a attention layer which consist of Query, Key and Value.
So as I understand, I should perform a matrix multiplication between Query and Key then
I should normalize matrix by sqrt of it’s dimension and then pass the result to a softmax.
My question is about softmax and it’s not clear for me to understand on which dimension of result of multiplication between Query and Key should I perform softmax?
dimension of Query : [batch, depth, x, y]
dimension of Key : [batch, depth, x, y]
Query * transpose(Key) → Normalization → softmax