I have the following code to select RoIs:
gt_boxes = [t["boxes"] for t in targets]
images, targets = self.transform(images, targets)
features = self.detection.backbone(
images.tensors
)
if isinstance(features, torch.Tensor):
features = OrderedDict([("0", features)])
objects_roi = self.detection.roi_heads.box_roi_pool(
features, gt_boxes, images.image_sizes
)
I am using Faster R-CNN as my detector. I have been looking into the layer where it applies RoI Pooling and I have found out MultiScaleRoIAlign class does this. I have created the following example to make sure the logic of MultiScaleRoIAlign matches with my expectation:
from typing import Optional, List, Dict, Tuple, Union
from torchvision.ops.boxes import box_area
from torchvision.ops import roi_align
from torch import nn, Tensor
import torchvision
import torch
from collections import OrderedDict
import matplotlib.pyplot as plt
import numpy as np
import cv2
image_sizes = [(512, 512)]
arr = np.zeros(image_sizes[0])
bboxs = [(100, 10, 170, 190), (250, 120, 370, 250), (100, 250, 130, 270)]
for (x1, y1, x2, y2) in bboxs:
arr[y1:y2, x1:x2] = np.ones((y2-y1, x2-x1))
arr1 = cv2.resize(arr, dsize=(124, 124), interpolation=cv2.INTER_CUBIC)
arr2 = cv2.resize(arr, dsize=(64, 64), interpolation=cv2.INTER_CUBIC)
arr3 = cv2.resize(arr, dsize=(32, 32), interpolation=cv2.INTER_CUBIC)
arr4 = cv2.resize(arr, dsize=(16, 16), interpolation=cv2.INTER_CUBIC)
m = torchvision.ops.MultiScaleRoIAlign(
featmap_names=['0', '1', '2', '3'], output_size=7, sampling_ratio=2)
i = OrderedDict()
i['0'] = torch.Tensor(np.expand_dims(
np.stack([arr1 for _ in range(1)]), axis=0))
i['1'] = torch.Tensor(np.expand_dims(
np.stack([arr2 for _ in range(1)]), axis=0))
i['2'] = torch.Tensor(np.expand_dims(
np.stack([arr3 for _ in range(1)]), axis=0))
i['3'] = torch.Tensor(np.expand_dims(
np.stack([arr4 for _ in range(1)]), axis=0))
boxes = torch.Tensor(bboxs)
# original image size, before computing the feature maps
output = m(i, [boxes], image_sizes)
It seems the issue was in this block of code source:
levels = mapper(boxes)
It seems that levels
is a list for every bbox mapped to a single level, so levels = (1,1,0)
for the example given. The following code, where it selects the level, only selects a single for every level source:
idx_in_level = torch.where(levels == level)[0]
I was expecting output
to be (num_bboxes x num_features, 1, output_size, output_size), however, what I am getting is (num_bboxes, 1, output_size, output_size)