Understanding ROI Pool of fast r-cnn

basingse · August 20, 2022, 7:53am

Hello all,

I am facing difficulty when understanding the roi pool layer that was implemented in torchvision torchvision.ops.roi_pool. I have attached the sample code below

import torch
import torchvision

import numpy as np
feature_map = np.array([
    [0.70, 0.41, 0.38, 1.23, 0.24],
    [0.14, 0.45, 0.31, 0.73, 3.22],
    [0.11, 0.41, 0.79, 0.69, 0.44],
    [1.47, 0.25, 0.09, 0.32, 2.98],
    [0.48, 0.87, 0.77, 0.26, 0.11],
])


feature_map = torch.tensor(feature_map, requires_grad=False).float()

# (batch, channel, h, w) -> (1, 1, 5, 5)
feature_map = feature_map.unsqueeze(0).unsqueeze(0)

# boxes -> (1, 5)
boxes = np.array([
    [0, 0, 0, 4, 4],
])
boxes = torch.tensor(boxes, requires_grad=False).float()

# roi pooling layer of 3x3
pool = torchvision.ops.roi_pool(input=feature_map, boxes=boxes, output_size=3)
print(pool)

The output of the above code is

tensor([[[[0.7000, 1.2300, 3.2200],
          [1.4700, 0.7900, 3.2200],
          [1.4700, 0.8700, 2.9800]]]])

So here the 5x5 map is divided in such a way that the output is 3x3. So 5x5 region can be divided the following way to 3x3 (there are many ways but I only drew 2)

How the subregions are decided in pytorch code (I checked the paper and there was no detail on this one). Is there any algorithm pytorch is referring to ?
My understanding says that there should be no overlap between sub-regions but that doesn’t explain the output of above code (as you can see 3.22 is present in (0, 2) and (1, 2) positions of output)

The versions I am using are :

torch = 1.10.2+cpu
torchvision = 0.11.3+cpu

Thanks.