Mask_rcnn_resnet50_fpn() 'min_size & 'max_size' arguments: what do they do?

Hi all,

High-level explanation:
I am confused by what the min and max size argument actually does - and how it should be optimized for both training and inference. I notice that after training my mask rcnn model, I can basically tinker with those two values and get dramatically different results on my test data.

More details:
I trained a mask r-cnn for ship detection. During training I instantiated my model like so:

def get_model_instance_segmentation(num_classes):
# load an instance segmentation model pre-trained pre-trained on COCO
model = torchvision.models.detection.maskrcnn_resnet50_fpn(

I did this because my training chips varied in size from 256x256 to 512x512. By default the min_size and max_size are 800, 1333 respectively. In the end I found this doesnt seem to matter for your chip size - but can impact your results during inference regardless of chip/input image size.

The min_size and max_size arg are passed as kwargs to torchvision.models.detection.maskrcnn() as per the source code here: torchvision.models.detection.mask_rcnn — Torchvision 0.12 documentation

What it says in the source code comments:
min_size (int): minimum size of the image to be rescaled before feeding it to the backbone
max_size (int): maximum size of the image to be rescaled before feeding it to the backbone

Digging deeper into the source code, the min and max size do some sort of a transform. But there is no explanation of what it does - and how it can impact your training and inferencing. Any help would be greatly appreciated.

Thank you everyone - I really appreciate your time in reading my request.