FCOS Implementation question

In the code of FCOS implemented in torchvision:

It uses anchors (For example on line 396 anchor_generator is defined if there is no anchor_generator defined while instantiating the class). However in the paper we read that the model is completely anchor-free. Could someone explain the reason why anchors are used?

FCOS per pixel i,j predictions can be thought of as single per pixel anchors. The use of the existing AnchorGenerator enables targets to be made at different scales.

Yes it can be thought of that way, but the paper’s main idea and goal was to not use anchors and do everything anchor free. In the original implementation on their github page, they in fact don’t use anchors. So it seems weird to me that torchvision’s implementation uses them.

Just like FCOS, this implementation regresses a single set of 4 offsets per i,j location on the output feature maps along with classification and centerness.

The 4 offsets are predicted relative to the i,j location and normalized by size when supervised.

The code is a bit tricky to follow as some elements of ssd are repurposed for FCOS. Anchors are used to hold stride information in their height and width to normalize the regressed box values. Ultimately, these anchors are converted to i,j center locations and the offsets are applied to yield predicted bboxes.

Please see compute_loss() in the FCOSHead and the decode() method in BoxLinearCoder.

1 Like

Ah I see, it makes more sense now. Thank you for clarification