How much input size of pretrained model affect classification model

when i use pre-trained model(like mobilenetv2) to train my classification model, at first i set the input size of my classicication model as $64\times 128$, the effects was not good, later i try to change my input size to $128 \times 256$, i retrain my model, and get a better result.

is that means when i use pre-trained model from imagenet, The input size needs to be as close as possible to the original input(224 \times 224) of the pre-trained model。 and why it matters to a classification model.

Generally, the spatial resolution should match the original input resolution, as the conv kernels have been trained to extract useful features with this particular setup.
E.g. edge detector filters, which were trained on a particular resolution might not detect edges, if you zoom in/out of the image.

I guess the spatial resolution must not necessarily be 224x224, if the “pixel spacing” would stay the same. I.e. if the edges and other features still use the same amount of pixels, the filters might still work fine, so you could run some experiments to verify it.