import torch
from torchvision import models
model = models.vit_b_32(pretrained=True ,image_size=320)
model.eval()
The above piece of code is failing at Line 3 with the below error:
ValueError: The parameter 'image_size' expected value 224 but got 320 instead.
So does Pytorch’s pre-trained Vision Transformer model take only a fixed shape input image size unlike pre-trained ResNet’s which are flexible with the image size ?
I am shying a bit from downsizing my image as I am trying to perform Crack-detection on some metal surfaces. After downsizing to 224, the crack pixels become far too small which I believe may affect my model’s performance. When I train my model on ResNet’s, I get the optimal performance for image shapes > 400px
If pretrained, yes. 224 is the defacto size. If you do not need pretrained, you’ll need to specify the image_size and patch_size arguments.
Now, you might be able to get away with pretrained if you change out some layers. And then redefine the image_size, post mortem. Note, you’ll want to keep the patch_size unchanged.