I played with the MaskRCNN implementation from torchvision and made myself familiar with it. I am basically following the TorchVision Object Detection Finetuning Tutorial. Using the pretrained COCO model, I can run inference and the results are not so bad. Sometimes a table is a book, but these are anyway not the objects I am interested in
I managed to create train code for my own dataset, using the pretrained COCO model, overcome the memory issues with CUDA (using 2 environments, one 2GB and another with 10GB) with image and batch sizes.
With the dataset (20.000 images / 2 classes / 10.000 images each class) I train a pretrained model using the tools from references/detection
(batch size 8, epochs 10).
But the results are not very satisfying. The network returns some bboxes, masks and class labels, but they have nothing to do with the examples which have been used during training nor with anything from COCO. I assume it has to do with my approach or my understanding on how to finetune Mask-RCNN.
My dataset contains 24x40 grayscale images, each image shows exactly an object/instance, which is of rectangular shape. Therefore I generate bboxes of shape 24x40 and binary masks of the same size. The idea is to find these objects on larger images, lets say 640x512.
I modified the model according to the examples from the tutorials:
def __create_model__(self, num_classes, pretrained=True, **kwargs):
self.num_classes = num_classes
# load an instance segmentation model pre-trained on COCO
self.model = mask_rcnn.maskrcnn_resnet50_fpn(pretrained=pretrained, **kwargs)
# get number of input features for the classifier
in_features = self.model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
self.model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# now get the number of input features for the mask classifier
in_features_mask = self.model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# and replace the mask predictor with a new one
self.model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
hidden_layer,
num_classes)
return self.model
I learned that training and the pretrained model uses mean/std normalization, which I then applied during inference as well. Then I removed mean/std normalization by supplying the proper values to MaskRCNN (mean=0, std=1). Then I started to train a model from scratch, not using a pretrained model. And now I am facing (again) the loss = NaN problem.
- Is the approach ok? All other examples I saw did not create bboxes and masks for the whole for training.
- Did I miss something when I adapt the model to 3 classes (background, positive, negative), as COCO has 81 classes
- Are examples of size 24x40 too small for the network to learn representative features or is the network too deep for the input?
- As the dataset is what it is, might it help when I put each 24x40 image onto a larger randomly generated image to get some background?
- The network is designed for RGB images. What do i need to consider when using grayscale images? Is converting each grayscale image to RGB sufficient?
- As loss seem to explode, might another optimizer than SGD be a better choice?