Why is nms so slow?

When the object confidence threshold is set to 0.01,the nms is extremely slow. But when it is set to 0.5 in the case where most object are filtered out, nms become very fast,but the recall become slow. The nms code is as follows which is referenced from github.

def non_max_suppression(prediction, conf_thres=0.5, nms_thres=0.4):
Removes detections with lower object confidence score than ‘conf_thres’ and performs
Non-Maximum Suppression to further filter detections.
Returns detections with shape:
(x1, y1, x2, y2, object_conf, class_score, class_pred)

# From (center x, center y, width, height) to (x1, y1, x2, y2)
prediction[..., :4] = xywh2xyxy(prediction[..., :4])
output = [None for _ in range(len(prediction))]
for image_i, image_pred in enumerate(prediction):
    # Filter out confidence scores below threshold
    image_pred = image_pred[image_pred[:, 4] >= conf_thres]
    # If none are remaining => process next image
    if not image_pred.size(0):
    # Object confidence times class confidence
    score = image_pred[:, 4] * image_pred[:, 5:].max(1)[0]
    # Sort by it
    image_pred = image_pred[(-score).argsort()]
    class_confs, class_preds = image_pred[:, 5:].max(1, keepdim=True)
    detections = torch.cat((image_pred[:, :5], class_confs.float(), class_preds.float()), 1)
    # Perform non-maximum suppression
    keep_boxes = []
    while detections.size(0):
        large_overlap = bbox_iou(detections[0, :4].unsqueeze(0), detections[:, :4]) > nms_thres
        label_match = detections[0, -1] == detections[:, -1]
        # Indices of boxes with lower confidence scores, large IOUs and matching labels
        invalid = large_overlap & label_match
        weights = detections[invalid, 4:5]
        # Merge overlapping bboxes by order of confidence
        detections[0, :4] = (weights * detections[invalid, :4]).sum(0) / weights.sum()
        keep_boxes += [detections[0]]
        detections = detections[~invalid]
    if keep_boxes:
        output[image_i] = torch.stack(keep_boxes)

return output

I guess the internal loops would be called more so I would expect to see an increase in the runtime. You could check it by e.g. printing the number of invocations inside the loop and compare both approaches.