FasterRCNN on COCO with different combination of Resnet50 backbones

Hello,
I get very different results using apparently almost the same backbone for training COCO. As a basis I use code from here. Except for the network architecture all training parameters stay the same.

  1. Unmodified maskrcnn_resnet50_fpn model - I get:
    Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.354
    and thats consistent with the results reported at Vision’s detection website

  2. I get pretrained Resnet50 backbone and put it in FasterRCNN wrapper:

bb_model = torchvision.models.resnet50(pretrained=True)
backbone = nn.Sequential(bb_model.conv1,
                      bb_model.bn1,
                      bb_model.relu,
                      bb_model.maxpool,
                      bb_model.layer1,
                      bb_model.layer2,
                      bb_model.layer3,
                      bb_model.layer4)
backbone.out_channels = 2048
model = FasterRCNN(backbone,num_classes=num_classes)

This model has no frozen Batchnorm and first two layers are not frozen either - plus there’s no FPN . But the result I get is strikingly different, something I didn’t expect:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.239

  1. Third version I get the same backbone as in original training code, the difference is the final model has no FPN:
resnet50_fpn = torchvision.models.detection.maskrcnn_resnet50_fpn(num_classes=91)
backbone_fpn = nn.Sequential(
	resnet50_fpn.backbone.body.conv1,
	resnet50_fpn.backbone.body.bn1,
	resnet50_fpn.backbone.body.relu,
	resnet50_fpn.backbone.body.maxpool,
	resnet50_fpn.backbone.body.layer1,
	resnet50_fpn.backbone.body.layer2,
	resnet50_fpn.backbone.body.layer3,
	resnet50_fpn.backbone.body.layer4
	)
backbone_fpn.out_channels = 2048
model = FasterRCNN(backbone_fpn,num_classes=num_classes)

The result with the same training protocol is a little bit better, but still something I didn’t expect:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.265

So the question is - why is that? Is that something wrong with combining backbone with FasterRCNN in this manner?