Attention Q*(Q+K).t()

Why is

Q*(K).t()

( t() mean transpose) done in attention, and not

Q*(Q+K).t()

for example, if we have two pixels, black and white, and want to represent each combination of them differently.

black white -> (QK)        white   black  -> (KQ)           black  black  -> (QQ)       white white -> (KK)                                                                                    
                                              black -> (Q)         white -> (K)
Q*(K).t()

will give same result for

black white

and

white black

whereas if we do,

Q*(Q+K).t()

then four would be different, other options could be

Q*(Q-K)

but then

black black
white white

would be same, or

Q*K*K

, but that would be computationally expensive than

Q*(Q+K)

or

(Q+K)

but then,

black white
white black

would be same

or

(Q-K)

but then,

white white
black black

would be same

or only

Q

or only

K

but then all four would be same,

or concat Q and K together, but that would mean higher computation would be required to carry this operation again, as size increased.

how do I check whether this formula will give better results, what would be the easiest way to do this?
Do I modify self attention in SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT, this tutorial, or does any easier way to check performance of this formula exist?

You could check with seq2seq as you mentioned. Do you have any update on the results?

Maybe using the same seeds?