Confusion about Mask R CNN output

Hello,
I am using the pytorch implementation of Mask R-CNN following the object detection finetuning tutorial. I am trying to finetune it so it would be able to perform instance segmentation on images of nano particles (256x256x1). There are only two classes background + nanoparticle.
The model is performing horrendously - validation mAP for ‘bbox’ around 0.1, and mAP for ‘segm’ around 0.06. Right now I’m trying to pinpoint the exact problem on why it is performing so badly and after running the evaluate method(from their github) I noticed that the “masks” output is always 100 masks, even though some of my images have over 4000 masks, why is this 100 value hard-coded and could it be the culprit behind the bad performance?

Below is how I create the model. I configured the min_size and max_size parameters, because my GPU (RTX 3060) was running out of memory quickly while performing calculations with the default values (800,1333).

from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

model = torchvision.models.detection.maskrcnn_resnet50_fpn_v2(
    weights='MaskRCNN_ResNet50_FPN_V2_Weights.COCO_V1',
    min_size=(256, ),
    max_size=256,
    trainable_backbone_layers = 2,)

in_features = model.roi_heads.box_predictor.cls_score.in_features

model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes=2)

in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels

hidden_layer = 256
### and replace the mask predictor with a new one
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
                                                   hidden_layer, 2)
device = 'cuda'

Thank you in advance.

Hi Dom!

I am not knowledgeable about Mask R-CNN, but my gut reaction is that
it is not well suited to detecting that many objects in an image.

If your objects don’t normally touch / overlap, I would suggest using
semantic segmentation (e.g., U-Net) followed by some post-processing
(e.g., connected components) to identify the instances.

Also, depending on the details of your use case, you might look into
using something like StarDist.

Best.

K. Frank

1 Like

Thank you for the information, I’ll look into this if mask R-CNN fails me.

I managed to increase the amount of detections possible in an image, by passing the parameter box_detections_per_img = 4680 to the model, like so:

model = torchvision.models.detection.maskrcnn_resnet50_fpn_v2(
    weights='MaskRCNN_ResNet50_FPN_V2_Weights.COCO_V1',
    min_size=(256, ),
    max_size=256,
    trainable_backbone_layers = 3,
    box_detections_per_img = 4680)

The problem with low mAP still persisted though even after these changes. Here’s my last train epoch output before evaluation:

eta: 0:01:38  lr: 0.000060  loss: 0.5324 (0.5478)  loss_classifier: 0.0323 (0.0407)  loss_box_reg: 0.0924 (0.0906)  loss_mask: 0.2363 (0.2463)  loss_objectness: 0.0285 (0.0281)  loss_rpn_box_reg: 0.1232 (0.1420)  time: 0.4965  data: 0.1733  max mem: 6582

And the validation evaluation results:

Accumulating evaluation results...
DONE (t=0.64s).
Accumulating evaluation results...
DONE (t=0.62s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.105
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.148
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.134
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.105
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.012
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.108
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.108
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
IoU metric: segm
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.062
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.137
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.042
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.062
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.009
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.069
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.069
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

I even evaluated the train set and got nearly identical results to the ones evaluating validation data. Does anyone know what the problem might be? It’s as if it stops learning at 0.105 mAP for IoU 0.50:0.95.