R-CNN time scales linearly with batch size because of RPN?

I have a system which performs inference using fasterrcnn_resnet50_fpn on single images, and recently I’ve been testing the performance when I batch examples together. However, the time taken by model(images) seems to scale almost exactly linearly (+/- 5%) with the number of images in a batch – it’s taking the same amount of time per image. I’m not seeing any speedup from sending multiple images to the GPU at once.

Looking through the code, I’ve noticed that fasterrcnn_resnet50_fpn requires a list of images, rather than the Tensor I’d expect. Why?

There’s a lot of code called from fasterrcnn_resnet50_fpn, but so far I’ve found things that don’t seem to parallelize in:
models/detection/anchor_utils.py::AnchorGenerator.forward()
models/detection/transform.py::GeneralizedRCNNTransform.forward()
models/detection/rpn.py::RPNHead.forward()

I’m not good enough with profiling pytorch code yet to figure out what the real problems are, though. I’ve done enough profiling to verify that 95%+ of my code’s time is being used in fasterrcnn_resnet50_fpn.forward(), so I know it’s not a problem with data loaders or any of my surrounding code.

Are the RPN algorithms that pytorch uses fundamentally not able to parallelize across a batch? Is this an implementation problem? Have I misunderstood something?