How to select pixels of ROI from feature map

In Faster RCNN or Mask RCNN the ROI Align layers takes different sized roi’s as input and projects them onto a uniform layer size.
I’m currently implementing this paper where the authors take rotated boxes as inputs to inception layer for further detection. It is an end-to-end text recognition pipeline which first does text box detection( same task as object detection), then the text boxes are fed to an encoder-decoder to recognize the characters.

The output of the box detection as rotated bounding box, score map and last feature map are of the format :

Feature Map from text detector Shape torch.Size([1, 256, 128, 128])
ROI from text detector Shape torch.Size([1, 5, 128, 128])
Score from text detector Shape torch.Size([1, 2, 128, 128])

The five parameters represent the distances of the current point to the top, bottom, left and right
sides of an associated bounding box, together with its inclined orientation. With these configurations, the detection branch is able to predict a quadrilateral of arbitrary orientation for each text instance.

The original text detector architecture isbased on East..
I have been able to match it’s base line scores.

I’m stuck on the grid sampling part currently.
I’m first converting the 5 values - {4 distances , 1 angle} to corresponding 8 coordinates for a rotated bbox as a quadrilateral (scaled to feature map size, filtered by scores) to produce coordinates tensor of the format :

raw coord coords shape {numberOfPredictionsX8}
torch.Size([2960, 8])

Now I have to select only these pixels between these coordinates from the top feature map and then passing them to a nn.Functional.Interpolate() layer.

How to do this while keeping gradient intact.
I naively did the fillpoly directly on feature map directly first, but currently code looks like :

    fm_ = fm.clone().detach().numpy()
    coords_ = coords.clone().detach().numpy()

    masks = np.zeros((h_, w_), dtype=np.uint8)

    cv2.fillPoly(masks, (coords_.reshape(-1, 4, 2)).astype(np.int32), (1,1,1), 8)
    masks = torch.tensor((masks), dtype=torch.float32)

    fm *= masks

to preserve only pixel values from fm_ present inside a quadrilateral described by points in coords_
but cv function only works on np values.
But converting to np would lead to loss of gradient flow.

If I pass this fm to interpolate layer would that work correctly ?

updated code with suggestion, but still this approach is selecting extra pixels.

1 Like

It looks like you are just creating the mask using numpy and apply it on your tensor fm.
If you don’t need gradients for the mask, this approach should generally work.
You might have to transform masks to a valid (non-differentiable) tensor before multiplying it with fm.
Have you tried your approach and noticed that it’s not working?

Yep it’s a makeshift approach for now to just mask out all pixels that are not in my quadrilateral.
But the aim is to select only the pixels belonging to the quadrilateral roi’s and then pass those to the f.interpolate().
Anyway to do that ? kind of what tf.crop_and_resize does. but for floating point coordinates for a region defined by 8 coordinates and not 4 corners of a horizontal box.
Basically a perspective transform on the feature map, but with an operation that is differentiable.

I think in that case grid_sample might be a useful function, which uses pixel locations from the specified grid.

1 Like

Hello @ptrblck,
Please I am facing the same problem, I have 5 points that are produced from simple NN and I want to get the area inside them to generate a mask to be able to apply a segmentation loss on the mask.
I implemented it using cv2.fillploy after converting the 5 points to numpy but unfortunately, I lost the gradients.

I thought about your comment to use grid_sample but unfortunately, I don’t know how could I generate the mask with respect to the 5 points without losing the gradient tracing

My main problem is that, I don’t have the all desired position to create my grid, and If I use fillpoly to get the positions then I will loss the gradient trace.

And suggestions?

Usually you would create the target mask before the training so that you wouldn’t need gradients for this operation.
Could you explain your use case a bit and how you would like to use these gradients for the mask?

1 Like

First of all thank you for your fast reply.

Right, For the GT There is no need but my network generate for example 4 point which represent rotated box each point represented by (x,y) then I need to convert the network output to be mask as well to be ale to apply segmentation loss on both the predicted mask (after converting predicted point to mask and the GT mask)

I tried to convert the predicted points to mask using cv2.fillpoly but the loss wasn’t change during training, after investigation I realized that this behaviour because I loss the gredient trace when I convert the predections to numpy to use fillpoly, so I need another way to achieve that to be able to backprop on it

I hope this time I was clear enough

I’m unsure, but take a look at this approach and see, if you could reuse it.
If I understand it correctly, you are dealing with mask targets, but your model outputs just coordinates so you would want to create a mask using these coordinates?

1 Like

You are completely right, but unfortunate the link will not help me as it is talking about boxes, I think it is easy to generate mask from the coordinates if they will shape a box

But my case the coordinates will create an arbitrary shape for example a star shape so this approach which is prposed in this link will not help me, I need to use something like cv2.fillpoly but in pytorch to keep tracing gradient.

I see. You could try to check the source code of fillpoly and see, if you could either port it directly to PyTorch, so that Autograd would create the backward pass automatically for you, implement the backward function manually (if possible at all), or alternatively let your model output the mask directly (might be the easiest approach).

1 Like

Hello, I’m sorry to bother you.
May I ask you a question?
My question is that I have a feature vectors that operate a original image through a CNN and I also have a segmentation result. But I want to get the feature vectors of the yellow and green regions.
The segmentation result is shown.

How would these feature vectors be defined for these regions?
I.e. would you like to get a specific activation in a previous layer or would you like to process the output in a special way?

The feature vectors means that feature maps.
I want to process the output in a special way.
In other words, I want to the feature maps of the yellow and green regions.

To get the activation maps you could use forward hooks as described here.
However, the pixel locations of these maps might not correspond to the output locations and it depends on the architecture of your model.
E.g. convolution layers will use filters with a specific window, stride, dilation, etc., so that you would have to calculate the receptive field of the output locations for each activation map.

If I interpolate the feature maps with bilinear into the same size as the original image, and then use the pixel position to correspond one to one, is it ok?

No, this still wouldn’t work, as each output pixel position might be calculated by a larger field of the input activation(s).
If all your convolutions use a 1x1 kernel, the output pixel locations would correspond to the input locations.
Captum uses specific methods for model interpretability, which e.g. use the gradient flow to visualize which activations were “important” for which output part, but that doesn’t seem to be your use case.

Thanks your reply.
I understand what you’re saying.
But I still can’t figure out what method I should use if I want to extract feature maps of ROI for irregular shapes.

But I remember using ROI Align in Mask R-CNN, which can extract the feature maps of the candidate box.

ROIAlign interpolates the proposals, which are overlaid on the feature maps, no?

As I said, you could calculate the receptive field, if you know all conv setups.
Interpolating an activation map to the same shape as the output will not create a 1 to 1 mapping.

Thanks your reply.
I understand your means. And I know what I should do next.
Thanks again.