Multi-label muli-class Faster RCNN for object detection

I am currently trying to detect objects that have multiple labels. Each of the labels has different and multiple classes. In other words, I want to detect object instances that have multiple attributes assigned to them. For example, in an image with multiple cars, we want to detect each car instance as well as its corresponding attributes such as colour, number of wheels, etc. Currently, I only predict the individual attributes, but this results in ambiguity as to which object the individual attributes belong to.

Your use case sounds like instance segmentation and e.g. Mask R-CNN might be a useful model to check out. This example shows how to visualize some outputs of this model.

Thanks for the answer, I wonder if it is possible to assign multiple labels to the individual masks? Otherwise I would have to estimate the segmentation of the labels themselves and then reassemble them by comparing their regions to estimate the individual objects. Since I have to compare the regions of the individual masks, the performance would probably depend to a large extent on the pixel accuracy of the segments, especially in the case of occlusions.

In models like Mask-RCNN you can define multiple heads, used for different purposes, those are extensions of backbone, which produces embeddings of the photo, and objects on this photo.

If you look at the basic implementation of M-rcnn from Pytorch repo, you can see, it is though on bboxes, labels and masks. After training it returns all three based on the input data.

You can add additional heads for additional purpouses.

Hello again Thank you for your answer! Unfortunately, I still haven’t come up with a suitable solution. I had also considered using several heads, but the problem is that I have a variable number of objects in the image. The structure of my problem is as follows:

  • The image consists of a variable number of cars.
  • for each car i want to extract the following information: car number, car colour, car length, car wall, car roof, wheel_count, load_obj1, load_obj2, load_obj3.

In my recent approach i have implemented some kind of template matching to check wether the individual attribute mask are included in one of the whole car mask. Therefore, I calculate similarity values between each of the attribute masks and each of the vehicle masks and then select the affiliations based on these similarity values. The similarity scores are calculated as follows:

def get_similarity_score(mask, whole_car_mask):
    # determine to which degree mask is included in whole car mask
    # calculate similarity value by summing up all values in mask where mask is smaller than whole car mask
    # and summing up all values of whole car mask where mask is higher than whole car mask
    similarity = mask[mask <= whole_car_mask].sum() + whole_car_mask[mask > whole_car_mask].sum()
    similarity = similarity / mask.sum()

However, this approach does not perform the best and is very errorprone. My model is currently defined as follows:

weights = models.detection.MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = models.detection.maskrcnn_resnet50_fpn_v2(weights=weights,box_score_thresh=0.9)
# for predicting masks
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = <b>256</b>
# define a new head for the detector with required number of classes, 22 for the label specific classes and 20 as the upper bound for the number of cars which can be present in a scene
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,hidden_layer, 22 + 20 )
# for predicting boxes
# get the number of input features
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features,22 + 20 )