Instance segmentation mask R-CNN change backbone - fine tuning

Hi, I’m new in Pytorch and I’m using the torchvision.models to practice with semantic segmentation and instance segmentation.
I have used mask R-CNN with backbone ResNet50 FPN ( torchvision.models.detection. maskrcnn_resnet50_fpn) for instance segmentation to find mask of images of car, and everything works well.

I thought that with a different backbone maybe I could reach better result, so I’m trying to change the backbone of Mask R-CNN with MobileNet v2 or ResNext pre-trained, following the instruction on this Pytorch documentation (, but with bad result, producing something like this

I don’t know if either the models are incompatible (the weight of MobileNet v2/ResNext doesn’t match with mask R-CNN architecture) or I did something wrong in the implementation.
This is the code I have used for instancing the models and backbone:

# import necessary libraries
from PIL import Image
import matplotlib.pyplot as plt
import torch
import torchvision.transforms as T
import torchvision
import torch
import numpy as np
import cv2
import random

model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=False)

from torchvision.models.detection import MaskRCNN
from torchvision.models.detection.rpn import AnchorGenerator

# load a pre-trained model for classification and return only the features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features

# MaskRCNN needs to know the number of output channels in a backbone. For mobilenet_v2, it's 1280, so we need to add it here
backbone.out_channels = 1280

# let's make the RPN generate 5 x 3 anchors per spatial location, with 5 different sizes and 3 different aspect ratios. We have a Tuple[Tuple[int]] because each feature map could potentially have different sizes and aspect ratios
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))

# let's define what are the feature maps that we will use to perform the region of interest cropping, as well as the size of the crop after rescaling. if your backbone returns a Tensor, featmap_names is expected to be [0]. More generally, the backbone should return an OrderedDict[Tensor], and in featmap_names you can choose which feature maps to use.

roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],

# put the pieces together inside a MaskRCNN model
model = MaskRCNN(backbone,


After there are function for color mask, get prediction and instance segmentation …
and this is the code to preprocess and transform the image

img ='car1.jpg')

from PIL import Image
from torchvision import transforms

preprocess = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
input_tensor = preprocess(img)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

img = preprocess(img)
pred = model([img])

Hope someone can help me solving this problem,
thanks in advance!

I think that you are missing this @ser17 :

roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'],

mask_roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'],

model = MaskRCNN(backbone,

Question - In the tutorial
Tutorial on adding backbones to RCNN
I do not see a mention to the image normalizing:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])

Does this occur automatically somewhere in the data loader that was built for the PennFudanDataset or somewhere else? Do I need to add it manually somewhere? If someone can explain where this is, I’d appreciate it.


I think when you build a MaskRCNN object, the GeneralizedRCNNTransformation will be added as the first block of the model:

  (transform): GeneralizedRCNNTransform(                                              
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])        
      Resize(min_size=(800,), max_size=1333, mode='bilinear')                                              
  (backbone): Sequential(
    (0): Sequential(
      (0): ConvBNActivation(
        (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (1): FrozenBatchNorm2d(16)
        (2): Hardswish()
      (1): InvertedResidual(
        (block): Sequential(

This is added by the MaskRCNN(), as the backbone before passing to MaskRCNN() did not have this transform block.