Training retinanet_resnet50_fpn_v2 with varying output sizes and num_classes=2

I’ve been trying to train a RetinaNet on the SKU110K dataset. The first issue I encountered is that the retinanet_resnet50_fpn_v2() function throws the following error when num_classes=2:

model = retinanet_resnet50_fpn_v2(weights='DEFAULT', score_thresh=0.35, num_classes=2).to(device)

ValueError: The parameter 'num_classes' expected value 91 but got 2 instead.

My dataset only has 2 outputs (object and background), so why does the retinanet_resnet50_fpn_v2 function require num_classes=91?

The other issue is that each input image has a differing number of bounding boxes associated with it. My Dataset loads items in the following way:

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        img_name = self.image_names[idx]
        class_name, width, height = self.ds[self.ds.image_name == img_name].iloc[0, 5:8]
        boxes = np.array(self.ds[self.ds.image_name == img_name].iloc[:, 1:5])

        image = Image.open(f"{img_name}")

        image = image.resize((640, 640))

        # Rescale everything to 640
        Fx = image.width // 640
        Fy = image.height // 640

        boxes[:, 0] = boxes[:, 0] / Fx
        boxes[:, 2] = boxes[:, 2] / Fx
        boxes[:, 1] = boxes[:, 1] / Fy
        boxes[:, 3] = boxes[:, 3] / Fy

        image = self.transform(image) # [T.ToImageTensor(), T.ConvertImageDtype()]

        sample = {"img": image, "boxes": boxes, "label": 1}

        return sample

In this scenario, sample['img'] is a Tensor of shape (3, 640, 640) and sample['boxes'] is an array of shape (N, 4), where N is the number of bounding boxes in the image.

The other issue I have is that when I get a batch from the dataloader, I get


stack expects each tensor to be equal size, but got [74, 4] at entry 0 and [128, 4] at entry 1

Are there any Retinanet examples out there that use torchvisions v2 model?

I also tried adding a collate function to the dataloader that filled the batch with torch.ones tensors, but that throws a AssertionError: All bounding boxes should have positive height and width. Found invalid box error because it has 0 width and height.

The pretrained model is trained on 91 classes. If you want to change the classifier to output logits for 2 classes only initialize the model in its original form, replace the classifier, and finetune it.

Could you check if the Object detection finetuning tutorial would work for this model, too?

Try to filter out invalid bounding boxes which are empty.

Thanks for the response @ptrblck, I’m understanding it now.

The pretrained model is trained on 91 classes. If you want to change the classifier to output logits for 2 classes only initialize the model in its original form, replace the classifier, and finetune it.

I understand the 91 class issue now, I should only be fine-tuning the regression head, rather than the entire backbone. Thanks for your help!

Could you check if the Object detection finetuning tutorial would work for this model, too?

I’ve been using this tutorial, I think I was just having issues with the slightly different class names used in RetinaNet compared to FasterRCNN

Try to filter out invalid bounding boxes which are empty.

If I don’t add empty bounding boxes to the Dataset, then I get

RuntimeError: stack expects each tensor to be equal size, but got [150, 4] at entry 0 and [162, 4] at entry 1

How is one supposed to train the model since there are varying bounding boxes for each input image?

I would again refer to the linked tutorial since the PennFudanDataset also returns target with a variable number of bounding boxes:

dataset = PennFudanDataset('PennFudanPed', transforms=None)

for img, target in dataset:
    print(target["boxes"].shape)

torch.Size([2, 4])
torch.Size([1, 4])
torch.Size([1, 4])
torch.Size([2, 4])
torch.Size([2, 4])
torch.Size([2, 4])
torch.Size([3, 4])
torch.Size([2, 4])
...

I’m bad at programming