Using Transformer networks for images

Does anyone know any useful tutorial for Transformers in vision?

Thank you. I mean the usage of “attention is all you need” paper in vision. such as sequential images or image captioning