Questions about Mask RCNN

Hi, I want to use torchvision.models.detection.maskrcnn_resnet50_fpn

I have a use case of two classes (and background), so I’m invoking it with arguments pretrained=False, num_classes=3, pretrained_backbone=True.

My questions are:

  1. If the backbone is pretrained, does this mean I need to transform my input images to have the same distribution as the images that the original model was trained on?
    1.1 If so, how do I use it? What are the means and the std’s? Do I need to transform the data before forwarding, during the dataset creation (as with the PyTorch ResNet FCN model for semantic segmentation, pretrained on ImageNet, with mean = [0.485, 0.456,0.406] and std = [0.229, 0.224, 0.225]), or do I need to pass arguments image_mean, image_std to the MaskRCNN class?

  2. From the docs,

  • boxes ( FloatTensor[N, 4] ): the ground-truth boxes in [x1, y1, x2, y2] format, with values between 0 and H and 0 and W

Shouldn’t it be H - 1 and W - 1?

  1. From the docs,
  • masks ( UInt8Tensor[N, H, W] ): the segmentation binary masks for each instance

By binary mask the meaning is a matrix of zeros, and ones where the mask of instance exists? If so, why do we need to also pass boxes? Maybe I misunderstood, but can’t the boxes be calculated by the minimal and maximal coordinates of ones, per dimension? And also, wouldn’t it be safer to be a Boolean tensor?


Hi @M_S,

  1. The backbone is ResNet50 pretrained on ImageNet and the samples need to be normalized as you said. You should transform the data as it is done here.
  2. Yes , [0, H) and [0, W).
  3. There are N masks of the size of the image. Binary mask means that the pixel values are 0 everywhere except where the relevant instance is, where they are 1.
    UInt8Tensor is basically a boolean.

Hope this answers your questions!

1 Like

Thanks @spanev

  1. So what are the arguments image_mean, image_std used for?
  2. Why should we pass boxes if we already pass segmentation masks? The boxes can be calculated from the maps. Am I missing something?
  1. Looking at the code again, the transform is already defined in the FasterRCNN from which M-RCNN inherits. As you suggested, you only need to provide the right image_mean and image_std when instantiating maskrcnn_resnet50_fpn ( image_mean = [0.485, 0.456,0.406] and image_std = [0.229, 0.224, 0.225]).
    You can see that the transform is declared here and used here.

You can see it when printing the network:

  (transform): GeneralizedRCNNTransform()
  1. We pass masks of size HxW (where H is the height of the image and W is the width of the image). There is no information about the bbox in the mask. MaskRCNN applies a RoIAlign to the full mask during the target sampling phase to get the actual RoI mask.

You say there is no information of the bbox in the mask, but isn’t the bbox essentially defined by the minimal and maximal values of x and y containing 1 in the mask?

Sure, if you crop a mask w.r.t. to its corresponding bbox, you will have the mask.

During the training, the RPN proposes RoIs that are not necessarily aligned with the bbox (RoIs have a fixed squared size and we’re trying to get match them with regression*), but we have to crop the mask of the instance we are learning anyway: hence the full size HxW masks for each instance.

* Find more info about the bbox regression in the Fast-RCNN paper, Section 2.3 and Figure 1.

I agree completely but that’s not what I meant.
I mean there is a simple algorithm to calculate a bbox from a segmentation map:
x1 = smallest x that has 1 in map
y1 = smallest y that has 1 in map
x2 = highest x that has 1 in map
y2 = highest y that has 1 in map
so I don’t understand why we pass bboxes during training. We don’t need to pass them. They are easily calculated from the maps

Hi, asking again.
What’s the point of passing bboxes if we already pass masks?

It’s more efficient to calculate them in advance for training. I think maskrcnn uses them before predicting the mask.

Best regards


1 Like

I think this gives an option to the user to save the bboxes(s) in file(s) and load them along with the mask, hence there is no need to calculate bboxes, and that will save some computation time.