How to calculate CIoU or DIoU loss only for certain unmasked boxes in a tensor and ignore the masked values?

# bbox loss
bbox_labels = batch['bbox'][:, 1:]
bbox_masks = batch['bbox_mask'][:, 1:]

masked_bbox_preds = bbox_preds*bbox_masks
masked_bbox_labels = bbox_labels*bbox_masks

if self.config.bbox_loss == "smoothl1":
    box_loss = self.bbox_loss(masked_bbox_preds, masked_bbox_labels)
elif self.config.bbox_loss == "diou":
    masked_bbox_preds_xyxy = ops.box_convert(masked_bbox_preds, 'cxcywh', 'xyxy')
    masked_bbox_labels_xyxy = ops.box_convert(masked_bbox_labels, 'cxcywh', 'xyxy')
    box_loss = ops.distance_box_iou_loss(masked_bbox_preds_xyxy, masked_bbox_labels_xyxy, reduction='sum')
    bbox_masks_inv = (bbox_masks + 1) % 2
    box_loss -= bbox_masks_inv.sum()

box_loss = box_loss / (bbox_masks.sum() + self.eps)

This is a Transformer Decoder model in which one branch of the Decoder predicts the HTML structure sequence for a table image and the other predicts the bbox for corresponding cell tokens (‘< td >’, ‘< td >< /td >’, ‘<td’). So, for all the other tokens, I have to mask those predictions and calculate loss only for the td tokens.

In this, DIoU gives a value of 1.0 for boxes which are [0.0,0.0,0.0,0.0]. So, I subtract the number of masked boxes from the total loss to keep it consistent however, the model is predicting random boxes in places where it should predict properly. I’m not sure if this approach is correct in terms of weights/gradient updates. Is there any other way to use DIoU/CIoU with masks?

CIoU straight away gives nan for [0.0,0.0,0.0,0.0] because it considers the aspect ratio also, so I couldn’t figure out how to fix that.

So for input [0.0,0.0,1.0,1.0] (in xyxy format) in box1 and box2, CIoU and DIoU both give 1.9e-7 which I’m assuming is coming from an epsilon somewhere. So, I converted the masked predictions to this, but still, after one step, the loss becomes nan.

Any input would be really helpful. I have to experiment and complete the project soon. I need to use a good bounding box loss function (for table cell bounding box) along with cross entropy (for the html token prediction). I experimented with smoothl1 loss but the the model isn’t able to learn the boxes properly whereas the token predictions are good.
With the above technique I mentioned about diou, it did improve the bbox preds but still there is issue as I have mentioned.
Another approach I tried was to keep reduction=‘none’ and then replace the ‘nan’ values with zero using torch.nan_to_num and then calculate the mean. But the loss isn’t converging like it should by using this method. Not sure what is the issue here