Applying Mask-RCNN to custom dataset

I played with the MaskRCNN implementation from torchvision and made myself familiar with it. I am basically following the TorchVision Object Detection Finetuning Tutorial. Using the pretrained COCO model, I can run inference and the results are not so bad. Sometimes a table is a book, but these are anyway not the objects I am interested in :slight_smile:
I managed to create train code for my own dataset, using the pretrained COCO model, overcome the memory issues with CUDA (using 2 environments, one 2GB and another with 10GB) with image and batch sizes.
With the dataset (20.000 images / 2 classes / 10.000 images each class) I train a pretrained model using the tools from references/detection (batch size 8, epochs 10).

But the results are not very satisfying. The network returns some bboxes, masks and class labels, but they have nothing to do with the examples which have been used during training nor with anything from COCO. I assume it has to do with my approach or my understanding on how to finetune Mask-RCNN.

My dataset contains 24x40 grayscale images, each image shows exactly an object/instance, which is of rectangular shape. Therefore I generate bboxes of shape 24x40 and binary masks of the same size. The idea is to find these objects on larger images, lets say 640x512.
I modified the model according to the examples from the tutorials:

    def __create_model__(self, num_classes, pretrained=True, **kwargs):
        self.num_classes = num_classes

        # load an instance segmentation model pre-trained on COCO
        self.model = mask_rcnn.maskrcnn_resnet50_fpn(pretrained=pretrained, **kwargs)

        # get number of input features for the classifier
        in_features = self.model.roi_heads.box_predictor.cls_score.in_features
        # replace the pre-trained head with a new one
        self.model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

        # now get the number of input features for the mask classifier
        in_features_mask = self.model.roi_heads.mask_predictor.conv5_mask.in_channels
        hidden_layer = 256
        # and replace the mask predictor with a new one
        self.model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,

        return self.model

I learned that training and the pretrained model uses mean/std normalization, which I then applied during inference as well. Then I removed mean/std normalization by supplying the proper values to MaskRCNN (mean=0, std=1). Then I started to train a model from scratch, not using a pretrained model. And now I am facing (again) the loss = NaN problem.

  • Is the approach ok? All other examples I saw did not create bboxes and masks for the whole for training.
  • Did I miss something when I adapt the model to 3 classes (background, positive, negative), as COCO has 81 classes
  • Are examples of size 24x40 too small for the network to learn representative features or is the network too deep for the input?
  • As the dataset is what it is, might it help when I put each 24x40 image onto a larger randomly generated image to get some background?
  • The network is designed for RGB images. What do i need to consider when using grayscale images? Is converting each grayscale image to RGB sufficient?
  • As loss seem to explode, might another optimizer than SGD be a better choice?
  1. i dont think approach is ok, you should pass the image with the bbox noted and you are only passed extremly small images where all image is the bounding box, i think 24x40 is super small

  2. your index for datasets has to start on 1, leaving 0 as background. Not sure if do this or not

I think your bigger problem is that objects you want to detect are super small + providing only bbox as image(im not sure about this but dont sounds very good in my head)

Only my opinion, and im not expert

Unfortunately I have only that 24x40 images. I can resize them, but then quality suffers. And the images are cropped and represent exactly the objects I want to identify in other, bigger images.

Do you mean the index for the class labels? I have three class labels: [0, 1, 2].
The dataset itself starts with an idx of 0 and goes up to 19999.
I use a dataloader which is able to iterate from 0 … 19999. Dataset doesn’t, as it never calls len, so it exceeds the index and tries to read an entry with idx=20000.

I am already working on integrating the 24x40 images into some bigger random generated images. Maybe this helps.

I think you will need custom anchors for that small size, i know some implementations of yolo do the cal to get those, but i dont know how to make it in pytorch

With the dataset (20.000 images / 2 classes / 10.000 images each class)

You say two classes, lets say cats and dogs, so cat is id 1 and dog is id 2, but you say to the model 3 as you leave 0 for background

num_classes (int): number of output classes of the model (including the background).
            If box_predictor is specified, num_classes should be None.

I already had a look on the code, and we can hand in a custom AnchorGenerator to MaskRCNN.

        rpn_anchor_generator (AnchorGenerator): module that generates the anchors for a set of feature

From FasterRCNN.init:

        if rpn_anchor_generator is None:
            anchor_sizes = ((32,), (64,), (128,), (256,), (512,))
            aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)
            rpn_anchor_generator = AnchorGenerator(anchor_sizes, aspect_ratios)
        if rpn_head is None:
            rpn_head = RPNHead(out_channels, rpn_anchor_generator.num_anchors_per_location()[0])

Yes, in my Dataset implementation I add the class 0, the annotation file only has two classes, 1 and 2.
num_classes is thus 3: [0, 1, 2].