Applying attention on Image

I’m trying to apply attention on image.
Is there any good references about that?

I’m confusing about attention mechanism, why they work, what’s difference attention on text and image…

This will give you overview of attention in computer vision in general.
Attention in Computer vision.

