I have a question about the architecture of RPN and ROIALign in Faster/Mask R-CNN (see the image).
For each anchor in RPN there are two ‘heads’: 4 bbox values and 1 confidence value that are compared to the labels (offsets for bbox and 1/0 for overlap). In the pytorch implementation though,
cls_logits have 3 channels, and
bbox_pred 3x4=12. I don’t quite understand, what these 3 channels mean. There should be 1 and 4, as I understand.
In RoIPooling, there are two heads,
cls_score with the number of classes (91), and
bbox_pred with 4*91=364. I don’t quite understand, why we need bbox/class? We have a softmax prediction, so only 1 class can exist in this anchor, so we need 1 bbox, where do the other 360 come from? And what does the label look like? Offsets for 4 correct values and 0s otherwise?