Does PyTorch handle bilinear sampling for Mask-RCNN

I know there is torch.nn.functional.grid_sample, but I’m not sure that helps here.

1 Like

This is how I implement RoIAlign using affine_grid and grid_sample.

1 Like

Thanks! I have a few questions. First, why are you dividing by 16 in x1 = rois[:, 1::4] / 16.0. Second, I’m conceptually confused by:

 theta[:, 0, 0] = (x2 - x1) / (width - 1)
 theta[:, 0 ,2] = (x1 + x2 - width + 1) / (width - 1)
 theta[:, 1, 1] = (y2 - y1) / (height - 1)
 theta[:, 1, 2] = (y1 + y2 - height + 1) / (height - 1)

What is this calculating? I assume it’s x,y, width and height? And if so, why do you pass it to F.affine_grid? These don’t seem to be angles which you would want to transform bottom? (what is bottom?)


The input coordinates are under original image scale, and bottom is the Conv feature before roiAlign, so the coordinates are changed to feature scale.

Theta is the transformation matrix. X1, y1 is the left corner and x2, y2 KS the right bottom corner of the roi.

If you don’t fully understand, you can set some value to x1 y1 x2 y3 bottom, and run the function.

@ruotianluo Again, thanks. But my confusion is two-fold: I don’t see why the values of the transformation matrix are what they are. Second, I don’t see why code

grid = F.affine_grid(theta, torch.Size((rois.size(0), 1, pre_pool_size, pre_pool_size)))
crops = F.grid_sample(bottom.expand(rois.size(0), bottom.size(1), bottom.size(2), bottom.size(3)), grid)

would give you what you wanted. You are sampling after applying an affine transformation so there is no reason to think that you are sampling even remotely close to any region of interest anymore.

I must be missing something obvious.

Try this snippet

1 Like

@ruotianluo Thanks, but it still leaves me confused as to how you are deriving theta. Why is theta defined the way it is? Maybe I’m missing something obvious, but I still don’t see why this implements ROIAlign.

Yes, I understand Transformation Matrices and Affine Transformations. But that doesn’t explain where you got the values for theta. To give a concrete example of my confusion, why does theta [:,0,0] = (x2-x1)/(width - 1)? I understand that theta is the transformation matrix, but that doesn’t explain why theta, the transformation matrix, has the values you gave it. In other words, why did you choose the transformation matrix that you did?

(x2-x1)/(width - 1) is the scaling term. After the transformation, the width is x2-x1 and before it’s width - 1. (The reason why it’s width - 1 is we treat each pixel on the grid corners.)

I ported crop_and_resize from tensorflow for pytorch.
F.grid_sample has to expand the input feature and cost a lot of memery if we have too many rois.

crops = F.grid_sample(bottom.expand(rois.size(0), bottom.size(1), bottom.size(2), bottom.size(3)), grid)

How can this be ported for quadrilateral rois defined by 8 coordinates not just horizontal bboxes ?

would it make sense to use cv2 perspective transform to calculate theta as input to affine_grid ?