Why is
Q*(K).t()
( t() mean transpose) done in attention, and not
Q*(Q+K).t()
for example, if we have two pixels, black and white, and want to represent each combination of them differently.
black white -> (QK) white black -> (KQ) black black -> (QQ) white white -> (KK)
black -> (Q) white -> (K)
Q*(K).t()
will give same result for
black white
and
white black
whereas if we do,
Q*(Q+K).t()
then four would be different, other options could be
Q*(Q-K)
but then
black black
white white
would be same, or
Q*K*K
, but that would be computationally expensive than
Q*(Q+K)
or
(Q+K)
but then,
black white
white black
would be same
or
(Q-K)
but then,
white white
black black
would be same
or only
Q
or only
K
but then all four would be same,
or concat Q and K together, but that would mean higher computation would be required to carry this operation again, as size increased.