I am trying to speed up SSD family of object detectors in PyTorch. Most implementations use a CUDA-based non-maximum suppression (NMS) for efficiency, but the implementation (from Fast/er R-CNN) is image-based. If we have batch size larger than 1, the post processing & NMS becomes the bottleneck, as each image in the batch is processed serially, not in parallel, sth like this:
locs, scores = network.forward(images) # batch of images as input
bboxes, scores = postprocess(locs, scores) # convert bboxes to pixel coordinates
# the next part becomes the bottleneck for batched processing
for i in range(images.size(0)): # for each image in the batch
img_boxes = bboxes[i]
img_scores = scores[i]
for j in range(1, num_classes):
# filter bboxes by score threshold
keep = non_max_supression(img_boxes, img_scores, th=0.50)
# ...
As the batch size increases from 1 to 4, 8,âŚ, the post-processing & NMS time per image becomes more than the network.forward(images).
So, it is desirable to have a batched post-processing and non-maximum suppression that runs in parallel on the GPU. I could not think of a way to parallelize it in PyTorch (each image requires a separate processing), but it would be possible in CUDA.
I searched for such a batch implementation, but could not find any.
Do you know of any batch implementation of NMS?
Or any suggestions?
batch_multiclass_non_max_suppression defined in post_processing.py of Tensorflow Object Detection API might be helpful for the development of your NMS code.
Thanks for the cryptic TensorFlow code link
I will keep this in mind, but it would probably be easier to design and code it from scratch than trying to understand the cryptic TensorFlow code. I used TensorFlow for 1+ year, and I dislike it a lot.
Yes, but will the processes be scheduled to run in parallel on a single GPU?
That is, can multiple torch processes (which are said to be drop-in replacement for Python processes) run in parallel on the same GPU? I doubt it.
hello there,
using the awesome idea from torchvision âbatched_nmsâ, this following code can decode for several images / several classes at once, it works because batched_nms offsets boxes according to their category, so you never perform a wrong suppression.
Yes, if before you were decoding sequentially (with a python for loop, image-by-image), then with batched-nms you should get a very good acceleration (for me it went from 200 + ms to 14 ms using the code above that flattens all boxes & call torchvision.batched_nms)
btw, all of it is significantly slower if you have huge number of overlapping boxes (score_threshold ~= 0.1 or less), but it should be faster than per-image decoding anyway.
Hello Jochem, sorry for late answer, basically batch_nms works by offsetting boxes according to the index so you donât supress boxes that overlap while not being of same category or same image.
Using batch NMS is actually ~30% slower for me. While the NMS itself is faster, it has to measure overlap between a larger number of boxes. Which cancels the advantage of doing it in a batched fashion