```
# bbox loss
bbox_labels = batch['bbox'][:, 1:]
bbox_masks = batch['bbox_mask'][:, 1:]
masked_bbox_preds = bbox_preds*bbox_masks
masked_bbox_labels = bbox_labels*bbox_masks
if self.config.bbox_loss == "smoothl1":
box_loss = self.bbox_loss(masked_bbox_preds, masked_bbox_labels)
elif self.config.bbox_loss == "diou":
masked_bbox_preds_xyxy = ops.box_convert(masked_bbox_preds, 'cxcywh', 'xyxy')
masked_bbox_labels_xyxy = ops.box_convert(masked_bbox_labels, 'cxcywh', 'xyxy')
box_loss = ops.distance_box_iou_loss(masked_bbox_preds_xyxy, masked_bbox_labels_xyxy, reduction='sum')
bbox_masks_inv = (bbox_masks + 1) % 2
box_loss -= bbox_masks_inv.sum()
box_loss = box_loss / (bbox_masks.sum() + self.eps)
```

This is a Transformer Decoder model in which one branch of the Decoder predicts the HTML structure sequence for a table image and the other predicts the bbox for corresponding cell tokens (‘< td >’, ‘< td >< /td >’, ‘<td’). So, for all the other tokens, I have to mask those predictions and calculate loss only for the td tokens.

In this, DIoU gives a value of 1.0 for boxes which are [0.0,0.0,0.0,0.0]. So, I subtract the number of masked boxes from the total loss to keep it consistent however, the model is predicting random boxes in places where it should predict properly. I’m not sure if this approach is correct in terms of weights/gradient updates. Is there any other way to use DIoU/CIoU with masks?

CIoU straight away gives nan for [0.0,0.0,0.0,0.0] because it considers the aspect ratio also, so I couldn’t figure out how to fix that.

Edit:

So for input [0.0,0.0,1.0,1.0] (in xyxy format) in box1 and box2, CIoU and DIoU both give 1.9e-7 which I’m assuming is coming from an epsilon somewhere. So, I converted the masked predictions to this, but still, after one step, the loss becomes nan.