Attention Q*(Q+K).t()

vainaijr · October 31, 2019, 12:27am

Why is

Q*(K).t()

( t() mean transpose) done in attention, and not

Q*(Q+K).t()

for example, if we have two pixels, black and white, and want to represent each combination of them differently.

black white -> (QK)        white   black  -> (KQ)           black  black  -> (QQ)       white white -> (KK)                                                                                    
                                              black -> (Q)         white -> (K)

Q*(K).t()

will give same result for

black white

and

white black

whereas if we do,

Q*(Q+K).t()

then four would be different, other options could be

Q*(Q-K)

but then

black black
white white

would be same, or

Q*K*K

, but that would be computationally expensive than

Q*(Q+K)

or

(Q+K)

but then,

black white
white black

would be same

or

(Q-K)

but then,

white white
black black

would be same

or only

but then all four would be same,

or concat Q and K together, but that would mean higher computation would be required to carry this operation again, as size increased.

vainaijr · November 16, 2019, 1:28pm

how do I check whether this formula will give better results, what would be the easiest way to do this?
Do I modify self attention in SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT, this tutorial, or does any easier way to check performance of this formula exist?

antgr · November 16, 2019, 10:47pm

You could check with seq2seq as you mentioned. Do you have any update on the results?

antgr · November 17, 2019, 12:49pm

Maybe using the same seeds?