The pretrained models are most likely sticking to the literature for the corresponding model, which often used input images of the shape 224 x 224 (often randomly cropped to this shape).
Since these torchvision models use adaptive pooling layers the strict size restriction was relaxed and you would be able to pass bigger images and (some) smaller images. Note that the min. size depends on the conv and pooling operations, which would create an empty output, if your input is too small.
That being said, I would not assume to see a good performance for largely differently shapes input images and would try to finetune the model for this use case.
I am able to pass much larger images and as you said, the adaptive pool layer before the fully connected layer allows for this.
What do you mean by fine-tuning the model? In terms of fine-tuning in the transfer learning domain or something more structural. For example, I was thinking adding a convolutional layer at the beginning of the pretrained network, followed by an AdaptiveAvgPool2d layer to bring the output size to 224x224 and then pass it into the next conv layer, which is in fact the first of the pretrained layer. Does this make sense?