I would like to compute cross attention between the bottleneck layer of an autoencoder (with the size of (2x128x6x6)
) and the output of the decoder layer (2x25x48x48)
. I’d like to keep the positional information and finally I would like to weigh the bottleneck layer output with the cross-attention values. I don’t have a clear idea how it should be down. I will appreciate if someone can advise me whether it is possible or not and refer me to an example code or paper. Thanks