In RetinaNet (e.g., in the Detectron2 implementation), the (focal) loss is normalized by the number of foreground elements num_foreground
. However, the number of elements being considered in the loss function are the valid elements valid_idxs
, i.e., foreground and background elements. So I would expect the last code line to be something like max(1, valid_idxs.sum())
. This is most probably also the behavior when using torch.nn.functional.binary_cross_entropy_with_logits
with reduction='mean'
.
The paper Focal Loss for Dense Object Detection from Lin et. al does not mention the normalization as far as I can see. The results are far better when normalizing with the number of foreground elements like in the Detectron2 code shown below.
Can someone please explain why the number of foreground elements is chosen for normalization instead the number of valid elements (foreground + background)?
valid_idxs = gt_classes >= 0
foreground_idxs = (gt_classes >= 0) & (gt_classes != self.num_classes)
num_foreground = foreground_idxs.sum()
...
# logits loss
loss_cls = sigmoid_focal_loss_jit(
pred_class_logits[valid_idxs],
gt_classes_target[valid_idxs],
alpha=self.focal_loss_alpha,
gamma=self.focal_loss_gamma,
reduction="sum",
) / max(1, num_foreground)