Why what use Transformer in cv project?

I am currently studying NLP and have learned about the Transformer section. After a brief study, I believe that this is because the NLP field has a large amount of input text content and cannot fix the range of numbers/letters. Therefore, I use Transformer to extract the input text content first, while the CV field is extremely constant. Regardless of the final result, each pixel is fixed at 1-255 * 3, or 1-255. In this case, Why is it said that Transformer can improve the effectiveness of CV projects