In Faster RCNN or Mask RCNN the ROI Align layers takes different sized roi’s as input and projects them onto a uniform layer size.
I’m currently implementing this paper where the authors take rotated boxes as inputs to inception layer for further detection. It is an end-to-end text recognition pipeline which first does text box detection( same task as object detection), then the text boxes are fed to an encoder-decoder to recognize the characters.
The output of the box detection as rotated bounding box, score map and last feature map are of the format :
Feature Map from text detector Shape torch.Size([1, 256, 128, 128])
ROI from text detector Shape torch.Size([1, 5, 128, 128])
Score from text detector Shape torch.Size([1, 2, 128, 128])
The five parameters represent the distances of the current point to the top, bottom, left and right
sides of an associated bounding box, together with its inclined orientation. With these configurations, the detection branch is able to predict a quadrilateral of arbitrary orientation for each text instance.
The original text detector architecture isbased on East..
I have been able to match it’s base line scores.
I’m stuck on the grid sampling part currently.
I’m first converting the 5 values - {4 distances , 1 angle} to corresponding 8 coordinates for a rotated bbox as a quadrilateral (scaled to feature map size, filtered by scores) to produce coordinates tensor of the format :
raw coord coords shape {numberOfPredictionsX8}
torch.Size([2960, 8])
Now I have to select only these pixels between these coordinates from the top feature map and then passing them to a nn.Functional.Interpolate()
layer.
How to do this while keeping gradient intact.
I naively did the fillpoly directly on feature map directly first, but currently code looks like :
fm_ = fm.clone().detach().numpy()
coords_ = coords.clone().detach().numpy()
masks = np.zeros((h_, w_), dtype=np.uint8)
cv2.fillPoly(masks, (coords_.reshape(-1, 4, 2)).astype(np.int32), (1,1,1), 8)
masks = torch.tensor((masks), dtype=torch.float32)
fm *= masks
to preserve only pixel values from fm_
present inside a quadrilateral described by points in coords_
but cv function only works on np values.
But converting to np would lead to loss of gradient flow.
If I pass this fm to interpolate layer would that work correctly ?
updated code with suggestion, but still this approach is selecting extra pixels.