Optimizing PyTorch Inference Pipeline with TensorRT: Parallel Processing and Memory Management Questions

Hi,

I’m trying to optimize a PyTorch inference pipeline for a segmentation model running in TensorRT. The general setup involves multiple producers reading images from Redis, preprocessing them, and feeding them to a TensorRT model. Then, multiple consumers will postprocess the output of the TensorRT model.

To maximize parallelism in the pre- and post-processing stages, I created multiple workers using the torch.multiprocessing module. My question now is how to optimize the data transfers from preprocessing to the TensorRT model, and from the model to postprocessing.

Here is the general setup with multiple pre- and post-processing workers, but only one inference worker:

import torch
import redis
import tensorrt as trt

input_queue = torch.multiprocessing.Queue()
output_queue = torch.multiprocessing.Queue()

def preprocessing_worker(input_queue):
    stream = torch.cuda.Stream()
    while True:
        with torch.cuda.stream(stream):
            image = redis.get()  # Fetch image from Redis
            preprocessed_image = preprocess(image).pin_memory()  # Preprocess and pin memory
            cuda_preprocessed_image = preprocessed_image.cuda(non_blocking=True)  # Copy to device
            stream.synchronize()
            input_queue.put(cuda_preprocessed_image)

def inference_worker(input_queue, output_queue):
    stream = torch.cuda.Stream()
    trtContext = ...  # Initialize TensorRT context
    while True:
        with torch.cuda.stream(stream), trtContext:
            input_image = input_queue.get()
            output = trtContext.execute_async_v2(input_image)  # Execute model
            stream.synchronize()
            output_queue.put(output)

def postprocessing_worker(output_queue):
    stream = torch.cuda.Stream()
    while True:
        with torch.cuda.stream(stream):
            output = output_queue.get()
            output = output.cpu()  # Copy to host
            postprocess_output(output)  # Postprocess output

My questions are:

  1. To enable parallel HtoD (Host to Device), DtoH (Device to Host) memory copies, and kernel executions, I have created separate CUDA streams for each worker. Is this necessary?
  2. If separate streams are needed, is it mandatory to synchronize the stream after a .cuda() call before putting the data into the queue?
  3. Does this method generate memory leaks if I don’t manually delete the data from the queues?

Thank you in advance for your help!

1 Like