Is there any tutorial on attention, in particular how to use cross_attention?
My task is multimodal emotion recognition, I want to use cross-attention to capture some dynamic relevant-information about audio and text modality.
could you please give me some tutorials or suggestions?
Thanks
best wishes