What is the output format of DETR?

I am going to fine-tune DETR on my dataset and I need to add some additional augmentation using albumentations. The question is, how should I prepare the targets format in __getitem__? Should it be in yolo, coco or pascal_voc format?
The original dataset uses coco format like [xmin, ymin, w, h] but I saw in dataset format that it converts to normalized [xmin,ymin,xmax,ymax] like:

boxes[:, 2:] += boxes[:, :2]

But is post processing, I see a function that converts [cx,cy,w,h] to [xmin,ymin,xmax,ymax]?

def box_cxcywh_to_xyxy():

So, in which format should I prepare the target bboxes so that they compare to predicted bboxes?