I am trying to speed up SSD family of object detectors in PyTorch. Most implementations use a CUDA-based non-maximum suppression (NMS) for efficiency, but the implementation (from Fast/er R-CNN) is image-based. If we have batch size larger than 1, the post processing & NMS becomes the bottleneck, as each image in the batch is processed serially, not in parallel, sth like this:
locs, scores = network.forward(images) # batch of images as input bboxes, scores = postprocess(locs, scores) # convert bboxes to pixel coordinates # the next part becomes the bottleneck for batched processing for i in range(images.size(0)): # for each image in the batch img_boxes = bboxes[i] img_scores = scores[i] for j in range(1, num_classes): # filter bboxes by score threshold keep = non_max_supression(img_boxes, img_scores, th=0.50) # ...
As the batch size increases from 1 to 4, 8,…, the post-processing & NMS time per image becomes more than the
So, it is desirable to have a batched post-processing and non-maximum suppression that runs in parallel on the GPU. I could not think of a way to parallelize it in PyTorch (each image requires a separate processing), but it would be possible in CUDA.
I searched for such a batch implementation, but could not find any.
Do you know of any batch implementation of NMS?
Or any suggestions?