Torchvision ToTensor MUCH slower than manually normalizing

I’ve been seeing VERY high CPU utilization when using ToTensor in my transforms and its easily the slowest operation in my transformation compositions.
torch: 1.8.1
torchvision: 0.9.1

been facing the same issue since 1.4.0, even tried building from source (tried 1.6.0 and 1.8.0)

def to_tensor(img): 
        img = np.array(img) 
        img = torch.from_numpy(img).float().permute(2,0,1) 
        img = img / 255.0 
        return img
In [12]: %%timeit 
    ...: to_tensor(img)                                                                                                                                                                                                                                                          
826 us +- 43.3 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)

vs torchvision

    ...: tfunc.to_tensor(img)                                                                                                                                                                                                                                                         
1.49 ms +- 73.5 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)

img is a 512x512 RGB PIL Image.
any particular reason this may be happening?

Besides the additional checks for numpy array inputs, number of channels etc. in the torchvision implementations, you are not creating a contiguous output (TF.to_tensor does it here) and would also see a performance penalty due to the needed copy after the permute operation.
You can check it via:

out = to_tensor(img)
> False

Next operations, which might require a contiguous input, would then trigger the copy and you would slow them down instead.

1 Like