MultiScaleRoIAlign logic

I have the following code to select RoIs:

gt_boxes = [t["boxes"] for t in targets]
images, targets = self.transform(images, targets)
features = self.detection.backbone(
if isinstance(features, torch.Tensor):
	features = OrderedDict([("0", features)])
objects_roi = self.detection.roi_heads.box_roi_pool(
	features, gt_boxes, images.image_sizes

I am using Faster R-CNN as my detector. I have been looking into the layer where it applies RoI Pooling and I have found out MultiScaleRoIAlign class does this. I have created the following example to make sure the logic of MultiScaleRoIAlign matches with my expectation:

from typing import Optional, List, Dict, Tuple, Union
from torchvision.ops.boxes import box_area
from torchvision.ops import roi_align
from torch import nn, Tensor
import torchvision
import torch
from collections import OrderedDict
import matplotlib.pyplot as plt
import numpy as np
import cv2

image_sizes = [(512, 512)]
arr = np.zeros(image_sizes[0])
bboxs = [(100, 10, 170, 190), (250, 120, 370, 250), (100, 250, 130, 270)]
for (x1, y1, x2, y2) in bboxs:
    arr[y1:y2, x1:x2] = np.ones((y2-y1, x2-x1))
arr1 = cv2.resize(arr, dsize=(124, 124), interpolation=cv2.INTER_CUBIC)
arr2 = cv2.resize(arr, dsize=(64, 64), interpolation=cv2.INTER_CUBIC)
arr3 = cv2.resize(arr, dsize=(32, 32), interpolation=cv2.INTER_CUBIC)
arr4 = cv2.resize(arr, dsize=(16, 16), interpolation=cv2.INTER_CUBIC)

m = torchvision.ops.MultiScaleRoIAlign(
    featmap_names=['0', '1', '2', '3'], output_size=7, sampling_ratio=2)
i = OrderedDict()
i['0'] = torch.Tensor(np.expand_dims(
    np.stack([arr1 for _ in range(1)]), axis=0)) 
i['1'] = torch.Tensor(np.expand_dims(
    np.stack([arr2 for _ in range(1)]), axis=0))
i['2'] = torch.Tensor(np.expand_dims(
    np.stack([arr3 for _ in range(1)]), axis=0))
i['3'] = torch.Tensor(np.expand_dims(
    np.stack([arr4 for _ in range(1)]), axis=0))
boxes = torch.Tensor(bboxs)
# original image size, before computing the feature maps
output = m(i, [boxes], image_sizes)

It seems the issue was in this block of code source:

levels = mapper(boxes)

It seems that levels is a list for every bbox mapped to a single level, so levels = (1,1,0) for the example given. The following code, where it selects the level, only selects a single for every level source:

idx_in_level = torch.where(levels == level)[0]

I was expecting output to be (num_bboxes x num_features, 1, output_size, output_size), however, what I am getting is (num_bboxes, 1, output_size, output_size)

After reading the FPN paper [1] again, I have realized that I have something mixed up. Let me share my conclusion, it seems that the MultiScaleRoIAlign is doing is mapping RoIs to which feature level then, in my case, Faster R-CNN will use this feature pyramid as an image pyramid to compute head.

I had FPN mixed up between Faster R-CNN and RPN implementation, refer to the paper to understand.