Is there any tutorial on attention, especially how to use cross_attention

Is there any tutorial on attention, in particular how to use cross_attention?

My task is multimodal emotion recognition, I want to use cross-attention to capture some dynamic relevant-information about audio and text modality.

could you please give me some tutorials or suggestions?
Thanks
best wishes