I feel mask slice takes too much time.

pred = torch.rand((30000, 10))
pred = pred.cuda(6)
ind = pred[:, 5] > 0.9

st = time.time()
pred = pred[ind]
time_pased = time.time() - st

The time_pased is about 10ms. Is it normal? Or did I do anything wrong?

CUDA operations are asynchronous, so you should add torch.cuda.synchronize() before starting and stopping the timer.

Also, use a few iterations to get a better estimates as well as some warm-up iterations before the profiling.

1. def non_max_suppression(self, prediction, conf_thres=0.5, nms_thres=0.3):
2.    """
3.    Removes detections with lower object confidence score than 'conf_thres'
4.   Non-Maximum Suppression to further filter detections.
5.    Returns detections with shape:
6.       (x1, y1, x2, y2, object_conf, class_conf, class)
7.    """
9.      min_wh = 2  # (pixels) minimum box width and height
10.    output = [None] * len(prediction)
11.    for image_i, pred in enumerate(prediction):
12.       pred_time = time.time()
13.       pred = pred[(pred[:, 4] > conf_thres) & (pred[:, 2] > min_wh) & (pred[:, 3] > min_wh)]
14.       end_pred_time = time.time()
15.       logger.info('pred_time in optimize {}'.format(end_pred_time - pred_time))

        # If none are remaining => process next image
        if len(pred) == 0:

        # Select predicted classes
        class_conf, class_pred = pred[:, 5:].max(1)

        # Box (center x, center y, width, height) to (x1, y1, x2, y2)
        pred[:, :4] = self.xywh2xyxy(pred[:, :4])
        pred[:, 4] *= class_conf  # improves mAP from 0.549 to 0.551

        # Detections ordered as (x1y1x2y2, obj_conf, class_conf, class_pred)
        pred = torch.cat((pred[:, :5], class_conf.unsqueeze(1), class_pred.unsqueeze(1).float()), 1)

        # Get detections sorted by decreasing confidence scores
        pred = pred[(-pred[:, 4]).argsort()]

        det_max = []
        nms_style = 'MERGE'  # 'OR' (default), 'AND', 'MERGE' (experimental)

        for c in pred[:, -1].unique():
            dc = pred[pred[:, -1] == c]  # select class c
            dc = dc[:min(len(dc), 100)]  # limit to first 100 boxes: 
            nms_id = torchvision.ops.nms(dc[:, :4], dc[:, 4], 0.3)

        if len(det_max):
            det_max = torch.cat(det_max)  # concatenate
            output[image_i] = det_max[(-det_max[:, 4]).argsort()]  # sort
        logger.info('merge time {}'.format(time.time() - merge_time))

    return output

This is the non_max_suppression code for my yolov3 post process. The shape of the argument prediction is batch_size X 27783 X (5 + class num).

So the for loop which begins from line 11 will repeat batch_size times.

When my batch_size is 1, the time spent on line 12 to line 15 is from 2ms to 10ms. When my batch_size is 5, the time spent on line 12 to line 15 is about 90ms, even if I have run the inference for more than 100 batch. The shape of pred after slice is less than 100 * (5 + class num).

Is the time spent on tensor slice(line 13) normal? How can I optimize it? I thought the step would take less than 1ms.

Did you measure these timings on the CPU or the GPU?
In the latter case you would have to synchronize your code as explained in the previous code.

Thank you very much for your quick and patient reply.

I measure these timings on GPU which is GeForce GTX 1080Ti.

If I want to measure these timings for a few part of my code, do I need to put torch.cuda.synchronize() at every place I do the measurement or do I just need to put 1 torch.cuda.synchronize() at the outmost part?

You would have to add the synchronization before starting and stopping the timer.
Also, make sure to run a few warmup iterations before the actual profiling and average the timing for a few iterations to get a more stable result.

Thank you very much.

I add the synchronization every place the timer starts and stops.

And the slice time becomes 0.5ms, and it turns out that the 10ms previous added to the slice is now added to the forward process.