torchvision.models documentation mentions that all models require an image input size of “at least 224x224”, with some exceptons, like the Inception model, which requires exactly (?) 299x299. Now I am wondering, for all other models, are really any other image dimensions fine as well?
I understand that technically these “modern” models can all handle other dimensions as well. But are there any constraints to the image dimensions, or better: are there any dimensions that work better than others? Or can I use any “random” sizes, like for example 474x474 or 834x834, or whatever? Or should I follow some “best practices” when choosing the image dimensions?
If I am not mistaken, they use an average pooling layer after convolutions. So the image data can go through the model without popping up errors. That is not to say that the comprehension of the model will be the same for all image sizes. It depends on the structure of the model.
For instance, a Conv2d layer with a given kernel may be able to capture a particular feature of the image important to classification at one size, but not when that same feature is much larger. (Read up on kernels to understand why.)
Due to the way neural networks learn, it’s not like there is a one size fits all answer to this. It may also be dataset dependent. But generally speaking, if you have enough compute resources, you should trying resizing beforehand and see what size works best for your application. Or if you don’t have those resources, just preprocess your images to the smallest size allowed, as that will also equate to less calculations/compute time during training or inference.