Using torchvision.transform.ToTensor within a ThreadPool executor yields strange results

Hello everyone,

So i stumbled up on a strange behavior while trying to implement a batch-level concurrent loader. The problem arises when i use a transfom function within a function that is used by a process/threadpool.

I think the snippetshould be self explanatory and also should allow you to reproduce this issue. I must say that it is quite mysterious to me:)

Ofcourse there are easy ways to get around the problem but, i thought that it is a peculiar case and might be interesting to know the root cause.

from torchvision.transforms import ToTensor
import PIL.Image
import cv2
import numpy as np
import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor



transform = ToTensor()
PIL.Image.fromarray(np.random.randn(1024, 1024, 3).astype('uint8')).save('im.jpg')
batch_of_images = ['im.jpg'] * 128


def load_image(path):
    image = cv2.cvtColor(cv2.imread(path), cv2.COLOR_BGR2RGB)
    return image
   
def load_image_w_transfom(path):
    image = cv2.cvtColor(cv2.imread(path), cv2.COLOR_BGR2RGB)
    return transform(image)
           
for func in [load_image, load_image_w_transfom]:
    t = time.time()
    with ThreadPoolExecutor(4) as tpe:
        res = list(tpe.map(func, batch_of_images))
    td = time.time()
    print("execution time w threads", str(func), td - t)
    
    t = time.time()
    for path in batch_of_images: func(path)
    td = time.time()
    print("execution time no threads", str(func), td - t)
        
execution time w threads <function load_image at 0x7fb51c02cef0> 0.6988275051116943
execution time no threads <function load_image at 0x7fb51c02cef0> 2.2862112522125244
execution time w threads <function load_image_w_transfom at 0x7fb51c02c3b0>11.793959856033325
execution time no threads <function load_image_w_transfom at 0x7fb51c02c3b0>5.095749616622925

torch v 1.3.0
torchvision v 0.4.1a0+d94043a

I’m not sure, but I think you might create too many threads, which will yield a performance hit.
torchvision.transforms.ToTensor should already use multiple threads and you should see it in the core utilization using e.g. htop.

If I limit the OpenMP threads to 1, I get a speedup, although it’s still slower than just using PyTorch internal functions:

$ python script.py
execution time w threads <function load_image at 0x7f07d032c5f0> 0.026784420013427734
execution time no threads <function load_image at 0x7f07d032c5f0> 0.016533851623535156
execution time w threads <function load_image_w_transfom at 0x7f077147e950> 5.865751266479492
execution time no threads <function load_image_w_transfom at 0x7f077147e950> 3.3059628009796143

$ OMP_NUM_THREADS=1 python script.py
execution time w threads <function load_image at 0x7f84762715f0> 0.031061887741088867
execution time no threads <function load_image at 0x7f84762715f0> 0.013380765914916992
execution time w threads <function load_image_w_transfom at 0x7f84173c3950> 1.241473913192749
execution time no threads <function load_image_w_transfom at 0x7f84173c3950> 3.595093011856079

Hm, the problem I’m trying to solve is to load a batch in a multithreaded way, let’s say reading 500 images. In a normal dataset+dataloader setup it will happen sequentially (with prefetch optional) but I feel that reading those might be improved this way as it’s a io heavy task.

But as I dig more into it might be less obvious and the TT ads cpu overhead (transpose)

Anyhow thanks for taking the time to look into this, and if you have any suggestions lmk.