Efficient spatio-temporal ROI extraction from videos

Hello,
I’m doing multiperson spatiotemporal action recognition; the action recognition model receives sequences of ROIs (detected per frame with an object detector) for each person and outputs the probabilities.
I’m trying to make the whole pipeline work in real-time (30 fps), and I noticed that the speed gets significantly lower with more people in the video.
The action recognition model receives the input for all people in a batch, but the preprocessing of its input data is done sequentially per person. Basically, I can crop all the frames of the video at the same time, but only for one person.
Is there a way to crop the video at multiple locations at the same time, so I can remove the per person loop?
Thank you.