The impact of padded images in Vision transformers and efficient nets

I am working with images with different input sizes, as far as I understood Vision transformers and efficient nets, work mainly on specific input sizes 224, 384,…
I was wondering what would be the effect of padded images on vision transformers (all the types of this research direction including the ones based only on attention concept and the variation that uses both convolutions and attention concept ) and on efficient nets?