How to deal with labels of different shape in Object detection? (Yolov3)

Super_Pie_Camel · December 30, 2020, 5:02am

I am currently implementing my own train.py for yolov3 object detection.

However, I met some problem about the labels.

For yolov3 object detection, it is allowed to detect multiple object in an image.
In this case, different image may have different number of detected object. It means that different image may have different number of ground truth bounding boxes.

Ideally, I would like to concatenate the ground truth bounding boxes of same image into a dimension.

Assume it is a single class problem
e.g
[image_index, object_index, parameter] (parameter[tx,ty,tw,th,Po,Pc1])

For image 1, 5 objects are detected in ground truth so its shape is [1, 5, 6]
For image 2, only 2 are detected in ground truth so its shape is [1, 2, 6]

In this case, the label for image 1 and image 2 cannot concatenate into one tensor as they have different shape at (dimension = 1).

To solve this problem, I have tried to combine them using
list = []
list.append(label1)
list.append(label2)

However, the tensor must be stack together in torch.utils.data.DataLoader.
Hence, I got the following Error owing to different shape of labels:
RuntimeError: stack expects each tensor to be equal size, but got [1, 16] at entry 0 and [6, 16] at entry 1

16 = 1+4+11, 11 classes

All my labels are retrieved from a single .csv file. I am currently implementing the train.py using 7 images and 11 classes. All images are padded and resized to 416x416

Thanks for your time.

ptrblck · January 7, 2021, 10:55pm

You could return the targets as e.g. a dict as is also done in this tutorial.

Super_Pie_Camel · January 8, 2021, 4:30pm

Thank you. I loaded the path of the images into a list in the datasets and load batch_size number of images per batch when iterating the data_loader.

manix · June 11, 2021, 9:41am

I have a doubt over how the variable number of bounding boxes are handled by the YOLO or RCNN, FRCNN or any other object detection model.
we have different number of bounding boxes for each images, but the target seems to take all the bounding boxes for an image together as a single data, how is the model handling it ?