Post-processing the network's output without starving the GPU

I’m executing a network that I’ve already trained.
It runs on a large set of images, for each one I want to send it through the network, do some post-processing, and save it to a file.

My problem is that the post-processing takes some time (~1-2 sec per image), and because of that, the GPU is starving for the next image, and the overall throughput is lowered.

What is the correct way to delegate the post-processing? Maybe something similar to the way DataLoader delegates the pre-processing?

The way I would do is it to push the output batches into a multiprocessing.queue. Allow the network forward pass to be in a separate process.

Then, have a few threads of post-processor workers listen to this queue.