I’m finetuning a Faster RCNN model. When it comes to the labels, the number of bounding boxes in the images varies between 0 to 3. Most images have only one bounding box. i.e. one appearance of an object. I saw that during training the input to the model should be a tuple of images and labels, and the labels should be in the form if a dictionary with five keys (boxes, labels, image_id, area and is_crowd). All values of the dictionary are tensors. The value of area, for example, is a tensor of shape batch size X maximal number of objects in an image. Because the maximal number of objects in an image in my dataset is 3, does it mean that I need to input the model fake objects for images that have fewer than the maximal number of objects? I mean that if an image has only one corresponding labeled bounding box I can’t input it to the network as is so I need to fill some fake information. In this case what area values do I need to input for the second and third bounding boxes of an image that in reality has only one bounding box? More generally, what should I do of each image has a different number of bounding boxes?
The same dilemma applies to the other values of the dictionary.
I should mention that when I tried to input the model with a different number of bounding boxes per image I got the following error: “Expected target boxes to be a tensor of shape [N, 4]…”. The error was solved when I began inputting the model with three bounding boxes per each image.