I seen most of the vision transformer architecture for classification task. Why it is not used for the regression application where output is also image such as deblurring or image translation?
If vision transformer is used in any such application, request you to share the link.
Just need to confirm like FCN, is it possible to use different image size for training and inference in ViT?
I think there have been pretraining methods that output completions for masked parts of the input. Is that like what you have in mind? (I cannot find the reference I was looking for, but MST: Masked Self-Supervised Transformer for Visual Representation | OpenReview seems somewhat similar.)
My impression is that it’s relatively resource-intense, but can be done.
I don’t think you can use varying image sizes with ViT easily as typically the patches are of fixed size and fixed number.
In my application, I need output as image. So I am bit confuse with the decoder part to generate the image from the output of the vision transformer encoder. In the mentioned reference paper, author used the CNN architecture in decoder part. My confusion is, Is there only CNN architecture in decoder part or some other architecture may help to regenerate desired output as image from vision transformer encoder output. Any reference may helpful.
An obvious one might be imagegpt: GitHub - openai/image-gpt , but in general, you if you have 16x16 pixel blocks and want c output channels, you could just make the last layer c * 16*16 and build your image from that by reshaping.