It uses anchors (For example on line 396 anchor_generator is defined if there is no anchor_generator defined while instantiating the class). However in the paper we read that the model is completely anchor-free. Could someone explain the reason why anchors are used?
Yes it can be thought of that way, but the paper’s main idea and goal was to not use anchors and do everything anchor free. In the original implementation on their github page, they in fact don’t use anchors. So it seems weird to me that torchvision’s implementation uses them.
Just like FCOS, this implementation regresses a single set of 4 offsets per i,j location on the output feature maps along with classification and centerness.
The 4 offsets are predicted relative to the i,j location and normalized by size when supervised.
The code is a bit tricky to follow as some elements of ssd are repurposed for FCOS. Anchors are used to hold stride information in their height and width to normalize the regressed box values. Ultimately, these anchors are converted to i,j center locations and the offsets are applied to yield predicted bboxes.
Please see compute_loss() in the FCOSHead and the decode() method in BoxLinearCoder.