PyTorch RoI pooling implementation deviation

I am trying to implement ROI pooling by PyTorch. Here’s the minimal demo.

import torch
import torch.nn as nn
import torch.nn.functional as F

def roi_pooling(feature_map, rois, size=(7, 7)):
    :param feature_map: (1, C, H, W)
    :param rois: (1, N, 4) N refers to bbox num, 4 represent (ltx, lty, w, h) 
    :param size: output size
    :return: (1, C, size[0], size[1])
    output = []
    rois_num = rois.size(1)

    for i in range(rois_num):
        roi = rois[0][i]
        x, y, w, h = roi
        output.append(F.adaptive_max_pool2d(feature_map[:, :, y:y+h, x:x+w], size))


if __name__ == '__main__':
    test_tensor = torch.tensor([
        [0.88, 0.44, 0.14, 0.16, 0.37, 0.77, 0.96, 0.27],
        [0.19, 0.45, 0.57, 0.16, 0.63, 0.29, 0.71, 0.70],
        [0.66, 0.26, 0.82, 0.64, 0.54, 0.73, 0.59, 0.26],
        [0.85, 0.34, 0.76, 0.84, 0.29, 0.75, 0.62, 0.25],
        [0.32, 0.74, 0.21, 0.39, 0.34, 0.03, 0.33, 0.48],
        [0.20, 0.14, 0.16, 0.13, 0.73, 0.65, 0.96, 0.32],
        [0.19, 0.69, 0.09, 0.86, 0.88, 0.07, 0.01, 0.48],
        [0.83, 0.24, 0.97, 0.04, 0.24, 0.35, 0.50, 0.91]
    test_tensor = test_tensor.view(1, 1, 8, 8)
    rois = torch.tensor([[0, 3, 7, 5]])
    rois = rois.view(1, -1, 4)
    output = roi_pooling(test_tensor, rois, (2, 2))

I implement this by referencing this website: And the test_tensor and RoI also comes from the website’s example. However, as the website display, the output should be
[ [0.85, 0.84], [0.97, 0.96] ]
instead of my demo’s output:
[ [0.85, 0.96], [0.97, 0.96] ]
So, what’s the exact problem of my code? Is the coordinate split phase wrong?


Your code looks fine.
Apparently the kernels of F.adaptive_max_pool2d overlap for odd input sizes:

a = torch.zeros(1, 1, 5, 5)
a[0, 0, 2, 4] = 1.0
F.adaptive_max_pool2d(a, (2, 2))
> tensor([[[[ 0.,  1.],
          [ 0.,  1.]]]])

I’m not sure this is wanted behavior.

What’s the meaning of “overlap for odd input sizes”? And Is the example in that website wrong?How should i implement it properly?

@ptrblck highlighted your question.
Say you pool width / height X to new width / height Y < X. Then is Y does not divide X, you’re in a bit of a pickle. The Adaptive pooling seems to implicitly “copy” some data in the middle. A less surprising alternative might be to pad the input to a multiple of X (“same” or “reflection” might be more intuitive than zeros), as those will then be part of the same pooling region. Or you could crop the edges off.

Best regards



Which strategy should be the correct form in the fast-rcnn method. The former one or the latter?
Also, I’ve seen the source code of chainerCV’s implementation of RoI pooling, which strategy did they use, here’s the cpu version code:

    def forward_cpu(self, inputs):
        self._bottom_data_shape = inputs[0].shape

        bottom_data, bottom_rois = inputs
        channels, height, width = bottom_data.shape[1:]
        n_rois = bottom_rois.shape[0]
        # `numpy.zeros` needs to be used because the arrays can be
        # returned without having some of its values updated.
        top_data = numpy.zeros((n_rois, channels, self.outh, self.outw),
        self.argmax_data = numpy.zeros(top_data.shape, numpy.int32)

        for i_roi in six.moves.range(n_rois):
            idx, xmin, ymin, xmax, ymax = bottom_rois[i_roi]
            xmin = int(round(xmin * self.spatial_scale))
            xmax = int(round(xmax * self.spatial_scale))
            ymin = int(round(ymin * self.spatial_scale))
            ymax = int(round(ymax * self.spatial_scale))
            roi_width = max(xmax - xmin + 1, 1)
            roi_height = max(ymax - ymin + 1, 1)
            strideh = 1. * roi_height / self.outh
            stridew = 1. * roi_width / self.outw

            for outh in six.moves.range(self.outh):
                sliceh, lenh = _roi_pooling_slice(
                    outh, strideh, height, ymin)
                if sliceh.stop <= sliceh.start:
                for outw in six.moves.range(self.outw):
                    slicew, lenw = _roi_pooling_slice(
                        outw, stridew, width, xmin)
                    if slicew.stop <= slicew.start:
                    roi_data = bottom_data[int(idx), :, sliceh, slicew]\
                        .reshape(channels, -1)
                    top_data[i_roi, :, outh, outw] =\
                        numpy.max(roi_data, axis=1)

                    # get the max idx respect to feature_maps coordinates
                    max_idx_slice = numpy.unravel_index(
                        numpy.argmax(roi_data, axis=1), (lenh, lenw))
                    max_idx_slice_h = max_idx_slice[0] + sliceh.start
                    max_idx_slice_w = max_idx_slice[1] + slicew.start
                    max_idx_slice = max_idx_slice_h * width + max_idx_slice_w
                    self.argmax_data[i_roi, :, outh, outw] = max_idx_slice
        return top_data,