Replacing For Loop with indexing

I have a numpy code which I want to convert into PyTorch. It is related to nms but I do not have enough expertise to write a CUDA kernel for it. Can someone please help me with the conversion. Specifically I was hoping to find a way to remove the for loop here. I think if the for loop is replaced, the operation might not be slow.

def nms(dets, scores, thresh):
    '''
    dets is a numpy array : num_dets, 6
    scores ia  nump array : num_dets,
    '''
    x1 = dets[:, 0]
    y1 = dets[:, 1]
    z1 = dets[:, 2]
    x2 = dets[:, 3]
    y2 = dets[:, 4]
    z2 = dets[:, 5]

    volume = (x2 - x1 + 1) * (y2 - y1 + 1) * (z2 - z1 + 1)
    order = scores.argsort()[::-1]  # get boxes with more ious first

    keep = []
    while order.size > 0:
        i = order[0]  # pick maxmum iou box
        keep.append(i)
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        zz1 = np.maximum(z1[i], z1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])
        zz2 = np.minimum(z2[i], z2[order[1:]])

        w = np.maximum(0.0, xx2 - xx1 + 1)  # maximum width
        h = np.maximum(0.0, yy2 - yy1 + 1)  # maxiumum height
        l = np.maximum(0.0, zz2 - zz1 + 1)  # maxiumum length
        inter = w * h * l
        ovr = inter / (volume[i] + volume[order[1:]] - inter)

        inds = np.where(ovr <= thresh)[0]
        order = order[inds + 1]

    return keep

A discussion about the same is continued here. Please refer to it for clarifications.