I have a question about the architecture of RPN and ROIALign in Faster/Mask R-CNN (see the image).

For each anchor in RPN there are two ‘heads’: 4 bbox values and 1 confidence value that are compared to the labels (offsets for bbox and 1/0 for overlap). In the pytorch implementation though, `cls_logits`

have 3 channels, and `bbox_pred`

3x4=12. I don’t quite understand, what these 3 channels mean. There should be 1 and 4, as I understand.

In RoIPooling, there are two heads, `cls_score`

with the number of classes (91), and `bbox_pred`

with 4*91=364. I don’t quite understand, why we need bbox/class? We have a softmax prediction, so only 1 class can exist in this anchor, so we need 1 bbox, where do the other 360 come from? And what does the label look like? Offsets for 4 correct values and 0s otherwise?

Thanks!