Focal loss normalization (RetinaNet)

In RetinaNet (e.g., in the Detectron2 implementation), the (focal) loss is normalized by the number of foreground elements num_foreground. However, the number of elements being considered in the loss function are the valid elements valid_idxs, i.e., foreground and background elements. So I would expect the last code line to be something like max(1, valid_idxs.sum()). This is most probably also the behavior when using torch.nn.functional.binary_cross_entropy_with_logits with reduction='mean'.

The paper Focal Loss for Dense Object Detection from Lin et. al does not mention the normalization as far as I can see. The results are far better when normalizing with the number of foreground elements like in the Detectron2 code shown below.

Can someone please explain why the number of foreground elements is chosen for normalization instead the number of valid elements (foreground + background)?

	valid_idxs = gt_classes >= 0
	foreground_idxs = (gt_classes >= 0) & (gt_classes != self.num_classes)
	num_foreground = foreground_idxs.sum()
	...
	# logits loss
	loss_cls = sigmoid_focal_loss_jit(
		pred_class_logits[valid_idxs],
		gt_classes_target[valid_idxs],
		alpha=self.focal_loss_alpha,
		gamma=self.focal_loss_gamma,
		reduction="sum",
	) / max(1, num_foreground)

My guess is that focal loss focuses on outputting high values only for the foreground boxes and the difficult background ones, aiming at balancing the importance given to the foreground and background boxes and trying to ignore the large amount of easily classified background boxes.
Hence in the case of an overwhelming quantity of backgrounds boxes, normalizing against number of foreground boxes (which is kind of close to foreground boxes + smoothed number of difficult boxes) seems to be a better choice than normalizing against number of (foreground boxes + humongous amount of background boxes).
Maybe something slightly more intuitive would be normalizing again 2*(number of foreground boxes) to take into account that there are not only foreground boxes processed by the focal loss but also a kind of equal “equivalent smoothed quantity” of background boxes.
Hope it makes sense, that is just my “intuitive math” interpretation.