Batch non-maximum suppression on the GPU

I am trying to speed up SSD family of object detectors in PyTorch. Most implementations use a CUDA-based non-maximum suppression (NMS) for efficiency, but the implementation (from Fast/er R-CNN) is image-based. If we have batch size larger than 1, the post processing & NMS becomes the bottleneck, as each image in the batch is processed serially, not in parallel, sth like this:

locs, scores = network.forward(images)         # batch of images as input
bboxes, scores = postprocess(locs, scores)   # convert bboxes to pixel coordinates

# the next part becomes the bottleneck for batched processing
for i in range(images.size(0)):   # for each image in the batch
     img_boxes = bboxes[i]
     img_scores = scores[i]
     
     for j in range(1, num_classes):
         # filter bboxes by score threshold
         keep = non_max_supression(img_boxes, img_scores, th=0.50)
         # ...

As the batch size increases from 1 to 4, 8,…, the post-processing & NMS time per image becomes more than the network.forward(images).

So, it is desirable to have a batched post-processing and non-maximum suppression that runs in parallel on the GPU. I could not think of a way to parallelize it in PyTorch (each image requires a separate processing), but it would be possible in CUDA.

I searched for such a batch implementation, but could not find any.
Do you know of any batch implementation of NMS?
Or any suggestions?

1 Like

batch_multiclass_non_max_suppression defined in post_processing.py of Tensorflow Object Detection API might be helpful for the development of your NMS code.

Thanks for the cryptic TensorFlow code link :smile:
I will keep this in mind, but it would probably be easier to design and code it from scratch than trying to understand the cryptic TensorFlow code. I used TensorFlow for 1+ year, and I dislike it a lot.

for i in range(images.size(0)): might be parallelizable by using torch.multiprocessing.

Yes, but will the processes be scheduled to run in parallel on a single GPU?
That is, can multiple torch processes (which are said to be drop-in replacement for Python processes) run in parallel on the same GPU? I doubt it.

torch.cuda.Stream might make it possible to execute NMS tasks on a single GPU device asynchronously.

A document about CUDA Stream

1 Like

This comment is helpful.

hello there,
using the awesome idea from torchvision “batched_nms”, this following code can decode for several images / several classes at once, it works because batched_nms offsets boxes according to their category, so you never perform a wrong suppression.

I also tried to accelerate box encoding, if you are interested you can have a peek here: https://github.com/etienne87/torch_object_rnn/blob/master/core/anchors.py

num_classes = cls_preds.shape[-1] - self.label_offset
num_anchors = box_preds.shape[1]
boxes = box_preds.unsqueeze(2).expand(-1, num_anchors, num_classes, 4).contiguous()
scores = cls_preds[..., self.label_offset:].contiguous()
boxes = boxes.view(-1, 4)
scores = scores.view(-1)
rows = torch.arange(len(box_preds), dtype=torch.long)[:, None]
cols = torch.arange(num_classes, dtype=torch.long)[None, :]
idxs = rows * num_classes + cols
idxs = idxs.unsqueeze(1).expand(len(box_preds), num_anchors, num_classes)
idxs = idxs.to(scores).view(-1)
mask = scores >= score_thresh
boxesf = boxes[mask].contiguous()
scoresf = scores[mask].contiguous()
idxsf = idxs[mask].contiguous()

keep = batched_nms(boxesf, scoresf, idxsf, nms_thresh)

boxes = boxesf[keep]
scores = scoresf[keep]
labels = idxsf[keep] % num_classes
batch_index = idxsf[keep] // num_classes

Does batched_nms speed up batch inference?

Yes, if before you were decoding sequentially (with a python for loop, image-by-image), then with batched-nms you should get a very good acceleration (for me it went from 200 + ms to 14 ms using the code above that flattens all boxes & call torchvision.batched_nms)

ok, I will try it soon, can I use in pytorch 1.1?

:grimacing: not sure when batched_nms was introduced, i would update everything if i was you

btw, all of it is significantly slower if you have huge number of overlapping boxes (score_threshold ~= 0.1 or less), but it should be faster than per-image decoding anyway.

Thanks, I’m trying to change nms in maskrcnn-benchmark to batched_nms.

Hi All,

What is expected for the ‘idxs’ argument ?

It doesn’t seem necessary for the normal nms.