ROI Align returns all zero tensor

Hi forum,

I was trying to train my own instance segmentation model following this object detection official documents. The dataset I used is fairly simple which only contains 1 category, the image size is always 512x512 and object size is quite small (most of them are with width or height of 10~50 pixels). I was using maskrcnn and did exactly like the tutorial.

The object detection part was working great! However, I found the loss of mask_predictor drop quickly(within merely several iterations, I’ve got ~100 iterations in 1 epoch) and very close to 0. I digged in a little bit and found the model constantly labels all pixels with “0” which I believe it means “background”. I did some parameters finetuning(like change anchor size etc.) but it didn’t work as I expected.

First I thought it was something wrong about my dataset or labels, but I’ve tested the same dataset on Colab using detectron2, it’s working perfectly on both object detection and instance segmentation. However, there are some reasons that I still need to work on the base of raw torchvision.

I finally came to /torchvision/models/detection/roi_align.py and checked how project_masks_on_boxes works and found out that the returned tensor is basically filled with 0 value. It’s like I can only get 1 or 2 proposals which have some other values out of 50 proposals in total. This confused me a lot. From what I’m understanding, Roi Align should comes up some solid value instead of all 0s. I would like to know if anybody has ever met this mask prediction problem before? and what is the possible solution or improvement should I implement? Any ideas would be appreciated.

FYI, I’m using python3.8, torch 1.8.2 and torchvision 0.9.2. I will put the core part how I initialize my model, and if there are any details about code please let me know, I will post the asked part. Thanks

def build_model():
      mask_roi_pool = torchvision.ops.MultiScaleRoIAlign(
           featmap_names=["0", "1", "2", "3"], output_size=14, sampling_ratio=0,
      )
      box_roi_pool = torchvision.ops.MultiScaleRoIAlign(
          featmap_names=["0", "1", "2", "3"], output_size=7, sampling_ratio=0,
      )
      anchor_generator = AnchorGenerator(
          sizes=((8,), (16,), (32,), (64,), (128,)),
          aspect_ratios=((0.5, 1.0, 2.0),) * 5,
      )

      # load an instance segmentation model pre-trained on COCO
      model = torchvision.models.detection.maskrcnn_resnet50_fpn(
          pretrained=True,
          pretrained_backbone=True,
          trainable_backbone_layers=3,
          rpn_nms_thresh=0.1,
          rpn_fg_iou_thresh=0.7,
          rpn_bg_iou_thresh=0.3,
          rpn_score_thresh=0.5,
          box_nms_thresh=0.1,
          box_fg_iou_thresh=0.7,
          box_bg_iou_thresh=0.3,
          box_score_thresh=0.25,
          box_detections_per_img=200,
          box_positive_fraction=0.25,
          bbox_reg_weights=[10.0, 10.0, 5.0, 5.0],
      )


      mask_predictor = MaskRCNNPredictor(
          in_channels=model.roi_heads.mask_predictor.conv5_mask.in_channels,
          dim_reduced=256,
          num_classes=2,
      )
      mask_head = MaskRCNNHeads(
          in_channels=256, layers=(256, 256, 256, 256), dilation=1
      )

      model.roi_heads.anchor_generator = anchor_generator
      model.roi_heads.mask_roi_pool = mask_roi_pool
      model.roi_heads.mask_predictor = mask_predictor
      model.roi_heads.mask_head = mask_head

      return model