In RetinaNet (e.g., in the Detectron2 implementation), the (focal) loss is normalized by the number of foreground elements
num_foreground. However, the number of elements being considered in the loss function are the valid elements
valid_idxs, i.e., foreground and background elements. So I would expect the last code line to be something like
max(1, valid_idxs.sum()). This is most probably also the behavior when using
The paper Focal Loss for Dense Object Detection from Lin et. al does not mention the normalization as far as I can see. The results are far better when normalizing with the number of foreground elements like in the Detectron2 code shown below.
Can someone please explain why the number of foreground elements is chosen for normalization instead the number of valid elements (foreground + background)?
valid_idxs = gt_classes >= 0 foreground_idxs = (gt_classes >= 0) & (gt_classes != self.num_classes) num_foreground = foreground_idxs.sum() ... # logits loss loss_cls = sigmoid_focal_loss_jit( pred_class_logits[valid_idxs], gt_classes_target[valid_idxs], alpha=self.focal_loss_alpha, gamma=self.focal_loss_gamma, reduction="sum", ) / max(1, num_foreground)