Calculating the cross attention between the bottleneck and the output of the decoder

I would like to compute cross attention between the bottleneck layer of an autoencoder (with the size of (2x128x6x6)) and the output of the decoder layer (2x25x48x48). I’d like to keep the positional information and finally I would like to weigh the bottleneck layer output with the cross-attention values. I don’t have a clear idea how it should be down. I will appreciate if someone can advise me whether it is possible or not and refer me to an example code or paper. Thanks