Bounding Boxes dimension format in an object detection format

I just have this little doubt over the bounding boxes dimensions that is used for Object detection problem with any of the pretrained models like FRCC or YOLO.

Despite which bounding box format they are following(like VOC) , i have an dataset where there are several bounding boxes for a single image. There are 2 ways to go with the problem

  1. map image to each individual bounding box and take it as an data point
    Example : dataset format in here
    In Cell 3
    Wheat_head_classification | Kaggle
    2)Make a dict of all the bounding boxes of an image and feeding it to the network.
    Example :
    AIcrowd | Tutorial with Pytorch, Torchvision and Pytorch Lightning ! | Posts
    But in this case how is the network able to learn from this variable length number of bounding boxes.

    I am missing some crucial understanding here which i am unable to get around with , could someone clear my doubts. :grin: