Bottleneck on "decode", "resize", and "read"


I’m training a regression task (output values between 0 and 100), and the inputs are images from plants. I’m using resnet18 here from torchvision.

I realized the GPU is going in maximum to ~40% utilization, but the problem is that it usually stays at 0%.
So I thought there was a bottleneck in the data loading/preprocessing steps, and used python3 -m torch.utils.bottleneck src/ to check.

Here are the results:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   121039  570.056    0.005  570.056    0.005 {method 'decode' of 'ImagingDecoder' objects}
    26068  180.834    0.007  180.834    0.007 {method 'resize' of 'ImagingCore' objects}
  1102194  104.741    0.000  104.741    0.000 {method 'read' of '_io.BufferedReader' objects}
    26068   14.408    0.001   14.408    0.001 {built-in method}
    28838    7.336    0.000    7.336    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     8160    5.703    0.001    5.703    0.001 {built-in method torch.conv2d}
    26071    4.836    0.000    4.836    0.000 {built-in method}
    66454    4.492    0.000    4.492    0.000 {method 'item' of 'torch._C._TensorBase' objects}
      816    4.220    0.005    4.220    0.005 {built-in method torch.stack}
    26068    2.800    0.000    2.800    0.000 {method 'contiguous' of 'torch._C._TensorBase' objects}
    26068    2.656    0.000    2.656    0.000 {method 'close' of '_io.BufferedReader' objects}
      326    2.234    0.007    2.234    0.007 {method 'run_backward' of 'torch._C._EngineBase' objects}
    26068    2.161    0.000    2.161    0.000 {method 'div' of 'torch._C._TensorBase' objects}
    52136    1.247    0.000  590.483    0.011 /home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/PIL/
    52136    0.999    0.000    4.623    0.000 /home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/pandas/core/internals/

The “cumtime” shows high values for decode, resize, and read, but I don’t know where this decode is in my code. For resize I suppose it comes from torchvision's Resize() and for read it must be from PIL.

I would like to understand how I can make my model run faster from this output.

My dataset is as follows:

from PIL import Image

from import Dataset
from torchvision import transforms

class CGIARDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform

    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        y = self.df.iloc[idx]['extent']
        img =[idx]['filename'])
        x = transforms.ToTensor()(img)
        if self.transform is not None:
            x = self.transform(x)
        return x, y

As you can see, I’m reading the image using PIL, and converting to tensor using ToTensor from torchvision.

The resize step is in my transform object:

transform = transforms.Compose([
    transforms.Resize(IMG_SIZE, antialias=True)

Could anyone give me some tips on this?
For example, should I change my image reading from PIL to another library?
Or where does the decode come from?


If multiple workers in the DataLoader don’t help you could replace PIL with PIL-SIMD which could accelerate the decoding and transformation steps.


Thanks for your reply!

I installed PIL-SIMD as recommended here but I got the following error when running my training script:

Traceback (most recent call last):
  File "src/", line 9, in <module>
    from torchvision import transforms
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/torchvision/", line 6, in <module>
    from torchvision import datasets, io, models, ops, transforms, utils
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/torchvision/datasets/", line 1, in <module>
    from ._optical_flow import FlyingChairs, FlyingThings3D, HD1K, KittiFlow, Sintel
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/torchvision/datasets/", line 10, in <module>
    from PIL import Image
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/PIL/", line 89, in <module>
    from . import _imaging as core
ImportError: /home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/PIL/ undefined symbol: PyObject_CheckBuffer

Do you know what is the problem?

I’m using an HPC cluster (CentOS 7).

No, unfortunately I haven’t seen this error before and don’t know what might be causing it. Note that I’m using PIL-SIMD myself and didn’t run into any issues.

Did you install with the “AVX2-enabled version” as stated here?

Yes, I pass CC="cc -mavx" to the build command.

I also passed the flag to install the AVX-2-enable version, but didn’t work.

To give you some updates:
I managed to solve the problem by saving resized images to the disk using from PIL.

Now the training script is reading directly from a folder with resized images and it’s way faster: before it was ~4min per epoch, and now it’s roughly 1min, which helped a lot to experiment with new ideas.
Moreover, the GPU utilization is good now, so I think the resizing step was struggling with my training.

Thanks for your help and I also hope this helps future readers.

