Why object detection models taking input as list of tensor?

Hello, I was going through the TorchVision Object Detection Finetuning Tutorial — PyTorch Tutorials 1.11.0+cu102 documentation.
There, I see that model (faster_rcnn model with backbone resnet50) takes a list of input image and these images can be various dimentions. But usually these pretrained classifier model’s like resnet or mobilenet does not takes variable size input. For example mobilenet v2 take input image of shaped (3,224,224). And also, instead of list takes a batch packed in a tensor.

does it doing these processing inside? is there any documentation for that?

Internally GeneralizedRCNNTransform will be used in GeneralizedRCNN and will resize the data using this code snippet.