I’ve scratched my head for sometime now to make CNN inference faster. I have trained and frozen a CNN whose last layer has been removed. I need to generate embeddings from new images using the CNN. The images have to go through the same image transformations that was used for training. I process all the image frames in a video, and I have about 8000 videos to process. Without multi processing it takes about 10s per video (~130 frames), which is very long, so I want to implement multiprocessing to try to make it faster.
I’ve set requires_grad
to False to all parameters in the CNN.
No multiprocessing code
transformations= transforms.Compose(
[
transforms.Resize((image_res, image_res)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
),
]
)
model = torch.load(filepath) # set requires_grad to False in all paramters after this
def create_embedding(video_folder, num_frames, transformations, cnn):
assert os.path.exists(video_folder)
input_images = []
for index in range(num_frames):
image_path = os.path.join(video_folder, "images{:04d}.png".format(index))
input_images.append(transformations(Image.open(image_path)))
# stack input_images to create a tensor of size num_frames * channels * height * width
input_images = torch.stack(input_images, dim=0)
# model forward pass
return cnn(input_images)
Loading all frames in the video and passing them all at once through the CNN is faster than passing each frame through CNN and stacking at the end.
Then I tried to use multiprocessing to parallelize the image loading and transformation task, since that’s taking the longest time.
import multiprocessing as mp
def f(input_images, video_folder, transformations, index):
image_path = os.path.join(video_folder, "images{:04d}.png".format(index))
# logger.debug(image_path)
input_images[index] = transformations(Image.open(image_path))
def create_embedding_parallel(video_folder, num_frames, transformations, model):
with mp.Manager() as manager:
input_images = manager.list(range(num_frames)) # define size to enforce image ordering
processes = []
for index in range(num_frames):
p = mp.Process(target=f, args=(input_images, video_folder, transformations, index))
p.start()
processes.append(p)
for p in processes:
p.join()
input_images = torch.stack([_ for _ in input_images], dim=0)
# model forward pass
return model(input_images)
This runs into race condition and various other errors.
I tried passing each input image through transformation and CNN forward pass in separate processes and stacking all at the end, but it was slower than without multiprocessing. I also tried pool.starmap(), reduced the number of processes, etc, but the performance is still slower.
I’d appreciate any pointers to make this faster - with or without multiprocessing.
Thanks.