Apply pre-processing on multiple images at the same time

soshishimada · August 22, 2018, 6:57am

Hello. I wrote a function to load and pre-process images as below. The function loads a single image and apply pre-processing one by one. However, it seems so slow…

If possible, I wanna apply pre-processing on multiple images at the same time not one by one.
Anybody knows how?

def read_2D(self, file_path):

    image_names = os.listdir(file_path) #get file names
    images = torch.zeros(1, 3, 224, 224)
    trans = transforms.Compose([
        transforms.Resize(size=(224, 224)),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ])
    
    for i in range(len(image_names)): 
        img = Image.open(file_path + image_names[i]).convert('RGB')
        img = trans(img).view(1, 3, 224, 224)
        images = torch.cat((images, img), 0)


    images = images[:-1]

    return image_names, images ```

kelam_goutam · August 22, 2018, 8:37am

If you want to achieve speedup have a look at Process Pools of python. Though it works great when we want to preprocess independent data such as images. In some cases it may not give better result because Process Pools require exchange of data among the python processes. If the data cannot be effectively passed among the process you may end up spending more time.

kelam_goutam · August 22, 2018, 9:33am

I have coded a bit for your problem. Though it need little bit of tweaking, it works fine.

soshishimada · August 22, 2018, 9:55am

Thank you so much!
I’ll check the execution time.

morpheusPrime · August 22, 2018, 3:14pm

I have a similar problem, only that I need a post-processing on the network output before I can pass it on. I was looking for some sort of DataLoader type approach, where I can got for a simple image and it automatically handles a batch of the same. The code by @kelam_goutam looks nice, but I was wondering if pytorch already has something.

My problem is my network outputs values in the range [-1, 1]. So I need to do a shift and scale, multiply by 255 and convert to uint8, before passing it on to imageio’s imwrite.

@soshishimada Would be interested to see your benchmarks on the code!

kelam_goutam · August 22, 2018, 3:25pm

@morpheusPrime I am not sure if pytorch have such a function. I didn’t check. However I was thinking the output of your model will be a batch of say 4 image then I guess you have to define a method for yourself in which you do the required post-processing as I did in my code and you need to use the Process Pool to call your post-processing method each time you predict. The reason is that during preprocessing we had the entire dataset at one place so we called the Process Pool only once and it helped but now we will be getting the output in batches.

morpheusPrime · August 22, 2018, 3:39pm

I see. Thanks for the tip!

But would it be useful here? I am not familiar with the Process Pool library, but if I have to call it at every batch output won’t it add an overhead as the processes would need to be started up? Would there be any better way of doing it, while adding the minimum amount of overhead?

kelam_goutam · August 22, 2018, 4:06pm

Frankly speaking I dont know. But I feel it will be an additional overhead which you will be putting by calling it repeatedly. The reason is that we would not have enough number of images to be processed simultaneously so the parallelization setup cost will be more. I feel we must do it only if our batch size is 64 or above. But setting up such a batchsize depends on our available resources.
However for my post processing I never used parallel execution.

morpheusPrime · September 3, 2018, 4:51pm

That makes sense. I guess since it is just evaluations, I can wait. Thanks for your input!