Multiprocessing during CNN inferencing

I’ve scratched my head for sometime now to make CNN inference faster. I have trained and frozen a CNN whose last layer has been removed. I need to generate embeddings from new images using the CNN. The images have to go through the same image transformations that was used for training. I process all the image frames in a video, and I have about 8000 videos to process. Without multi processing it takes about 10s per video (~130 frames), which is very long, so I want to implement multiprocessing to try to make it faster.

I’ve set requires_grad to False to all parameters in the CNN.

No multiprocessing code

transformations= transforms.Compose(
            transforms.Resize((image_res, image_res)),
                mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
model = torch.load(filepath)  # set requires_grad to False in all paramters after this

def create_embedding(video_folder, num_frames, transformations, cnn):
    assert os.path.exists(video_folder)
    input_images = []

    for index in range(num_frames):
        image_path = os.path.join(video_folder, "images{:04d}.png".format(index))  

    # stack input_images to create a tensor of size num_frames * channels * height * width
    input_images = torch.stack(input_images, dim=0)

    # model forward pass
    return cnn(input_images)

Loading all frames in the video and passing them all at once through the CNN is faster than passing each frame through CNN and stacking at the end.

Then I tried to use multiprocessing to parallelize the image loading and transformation task, since that’s taking the longest time.

import multiprocessing as mp

def f(input_images, video_folder, transformations, index):
    image_path = os.path.join(video_folder, "images{:04d}.png".format(index))  
    # logger.debug(image_path)
    input_images[index] = transformations(

def create_embedding_parallel(video_folder, num_frames, transformations, model):

    with mp.Manager() as manager:

        input_images = manager.list(range(num_frames))  # define size to enforce image ordering
        processes = []
        for index in range(num_frames):
            p = mp.Process(target=f, args=(input_images, video_folder, transformations, index))
        for p in processes:

        input_images = torch.stack([_ for _ in input_images], dim=0)

        # model forward pass
        return model(input_images)

This runs into race condition and various other errors.

I tried passing each input image through transformation and CNN forward pass in separate processes and stacking all at the end, but it was slower than without multiprocessing. I also tried pool.starmap(), reduced the number of processes, etc, but the performance is still slower.

I’d appreciate any pointers to make this faster - with or without multiprocessing.