Bottleneck on "decode", "resize", and "read"

igorkf · August 8, 2023, 8:53pm

Hello.

I’m training a regression task (output values between 0 and 100), and the inputs are images from plants. I’m using resnet18 here from torchvision.

I realized the GPU is going in maximum to ~40% utilization, but the problem is that it usually stays at 0%.
So I thought there was a bottleneck in the data loading/preprocessing steps, and used python3 -m torch.utils.bottleneck src/train.py to check.

Here are the results:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   121039  570.056    0.005  570.056    0.005 {method 'decode' of 'ImagingDecoder' objects}
    26068  180.834    0.007  180.834    0.007 {method 'resize' of 'ImagingCore' objects}
  1102194  104.741    0.000  104.741    0.000 {method 'read' of '_io.BufferedReader' objects}
    26068   14.408    0.001   14.408    0.001 {built-in method PIL._imaging.new}
    28838    7.336    0.000    7.336    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     8160    5.703    0.001    5.703    0.001 {built-in method torch.conv2d}
    26071    4.836    0.000    4.836    0.000 {built-in method io.open}
    66454    4.492    0.000    4.492    0.000 {method 'item' of 'torch._C._TensorBase' objects}
      816    4.220    0.005    4.220    0.005 {built-in method torch.stack}
    26068    2.800    0.000    2.800    0.000 {method 'contiguous' of 'torch._C._TensorBase' objects}
    26068    2.656    0.000    2.656    0.000 {method 'close' of '_io.BufferedReader' objects}
      326    2.234    0.007    2.234    0.007 {method 'run_backward' of 'torch._C._EngineBase' objects}
    26068    2.161    0.000    2.161    0.000 {method 'div' of 'torch._C._TensorBase' objects}
    52136    1.247    0.000  590.483    0.011 /home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/PIL/ImageFile.py:155(load)
    52136    0.999    0.000    4.623    0.000 /home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/pandas/core/internals/managers.py:1027(fast_xs)

The “cumtime” shows high values for decode, resize, and read, but I don’t know where this decode is in my code. For resize I suppose it comes from torchvision's Resize() and for read it must be from PIL.

I would like to understand how I can make my model run faster from this output.

My dataset is as follows:

from PIL import Image

from torch.utils.data import Dataset
from torchvision import transforms


class CGIARDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform

    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        y = self.df.iloc[idx]['extent']
        img = Image.open(self.df.iloc[idx]['filename'])
        x = transforms.ToTensor()(img)
        if self.transform is not None:
            x = self.transform(x)
        return x, y

As you can see, I’m reading the image using PIL, and converting to tensor using ToTensor from torchvision.

The resize step is in my transform object:

transform = transforms.Compose([
    transforms.Resize(IMG_SIZE, antialias=True)
])

Could anyone give me some tips on this?
For example, should I change my image reading from PIL to another library?
Or where does the decode come from?

Thanks!

ptrblck · August 9, 2023, 1:58am

If multiple workers in the DataLoader don’t help you could replace PIL with PIL-SIMD which could accelerate the decoding and transformation steps.

igorkf · August 9, 2023, 1:55pm

Thanks for your reply!

I installed PIL-SIMD as recommended here but I got the following error when running my training script:

Traceback (most recent call last):
  File "src/train_coral.py", line 9, in <module>
    from torchvision import transforms
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import datasets, io, models, ops, transforms, utils
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/torchvision/datasets/__init__.py", line 1, in <module>
    from ._optical_flow import FlyingChairs, FlyingThings3D, HD1K, KittiFlow, Sintel
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/torchvision/datasets/_optical_flow.py", line 10, in <module>
    from PIL import Image
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/PIL/Image.py", line 89, in <module>
    from . import _imaging as core
ImportError: /home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/PIL/_imaging.cpython-38-x86_64-linux-gnu.so: undefined symbol: PyObject_CheckBuffer

Do you know what is the problem?

I’m using an HPC cluster (CentOS 7).

ptrblck · August 9, 2023, 2:45pm

No, unfortunately I haven’t seen this error before and don’t know what might be causing it. Note that I’m using PIL-SIMD myself and didn’t run into any issues.

igorkf · August 9, 2023, 2:53pm

Did you install with the “AVX2-enabled version” as stated here?

ptrblck · August 9, 2023, 5:24pm

Yes, I pass CC="cc -mavx" to the build command.

igorkf · August 9, 2023, 5:29pm

I also passed the flag to install the AVX-2-enable version, but didn’t work.

To give you some updates:
I managed to solve the problem by saving resized images to the disk using Image.open(path).resize(IMG_SIZE) from PIL.

Now the training script is reading directly from a folder with resized images and it’s way faster: before it was ~4min per epoch, and now it’s roughly 1min, which helped a lot to experiment with new ideas.
Moreover, the GPU utilization is good now, so I think the resizing step was struggling with my training.

Thanks for your help and I also hope this helps future readers.

Yuhe_Zhang · November 16, 2025, 6:30pm

I know this thread is a bit old now, but the core issue op hit back then is still the same today: the CPU decode/resize step is usually the real bottleneck. torchvision.transforms.v2 helped a lot by moving many ops to GPU (color jitter, normalize, etc.), but there’s still overhead because each op ends up being a separate GPU kernel launch.

If anyone here has already optimized the CPU side (TurboJPEG, WebDataset, DALI, or just pre-resizing offline as op did), there’s still a bit more speed you can squeeze out after the resize step.

I’ve been working on a small Triton-based library that fuses common torchvision.transforms.v2 image augmentation ops — crop, flip, brightness/contrast/saturation, grayscale, normalize — into one single gpu kernel:

It doesn’t fix decode/resize, but once your batch is already on the GPU, the fused kernel is usually 5–12× faster than torchvision v2’s separate kernels, especially for larger images.

Super easy drop-in:

augment = ta.TritonFusedAugment(
    crop_size=224,
    horizontal_flip_p=0.5,
    brightness=0.2, contrast=0.2, saturation=0.2,
    mean=(0.4914, 0.4822, 0.4465),
    std=(0.2470, 0.2435, 0.2616),
    same_on_batch=False  # Each image gets different random params (default)
)

for images, labels in train_loader:
    images, labels = images.cuda(), labels.cuda()
    images = augment(images)  # All ops in 1 kernel per batch! 🚀
    ...

Might help anyone who already solved the CPU bottleneck but still sees GPU-side augmentation show up in the profiler.