Attention without RNN

I am doing a multi-modal classification task and now have 2 vectors, one of them is from texts and the other is from images. I want to use attention for the final output. But when I look it up, looks like all of the examples use rnn/lstm/gru to do it. But here I do not think I can use it since the 2 vectors are not related. What should I do? Thanks