How object detection training is done in batches?

I’ve been trying to train object detection models, and my question is:

How this models are trained in batches given that images have a different input shape in datasets? I’m trying to fine tune DETR and it was trained in batches of 2, what am I missing? is it common to train this models in batches of 1?