Faster R-CNN Image size in training and inference

For Faster/Mask RCNN, before training the model, min and max size of the input images is fixed, which is determined to a great extent by the CUDA memory capacity. This is OK.

What I found at inference time, is that accuracy of the model strongly depends on the size of the input image. By default, min_size=800, max_size=1333, but as I varied both hyperparameters, I got either better of worse results. I think the best results are obtained when the min_size and max_size are close to the input image’s size, but I’m not sure.

So this is my question: is there a confident way to find optimal input image size?

I would assume the model should work the best during validation using images with shapes and other properties as close as possible to the training data.
The shape limits are most likely added due to the architecture (but I haven’t looked into the source code to verify it).