Bottleneck on "decode", "resize", and "read"

Hello.

I’m training a regression task (output values between 0 and 100), and the inputs are images from plants. I’m using resnet18 here from torchvision.

I realized the GPU is going in maximum to ~40% utilization, but the problem is that it usually stays at 0%.
So I thought there was a bottleneck in the data loading/preprocessing steps, and used python3 -m torch.utils.bottleneck src/train.py to check.

Here are the results:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   121039  570.056    0.005  570.056    0.005 {method 'decode' of 'ImagingDecoder' objects}
    26068  180.834    0.007  180.834    0.007 {method 'resize' of 'ImagingCore' objects}
  1102194  104.741    0.000  104.741    0.000 {method 'read' of '_io.BufferedReader' objects}
    26068   14.408    0.001   14.408    0.001 {built-in method PIL._imaging.new}
    28838    7.336    0.000    7.336    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     8160    5.703    0.001    5.703    0.001 {built-in method torch.conv2d}
    26071    4.836    0.000    4.836    0.000 {built-in method io.open}
    66454    4.492    0.000    4.492    0.000 {method 'item' of 'torch._C._TensorBase' objects}
      816    4.220    0.005    4.220    0.005 {built-in method torch.stack}
    26068    2.800    0.000    2.800    0.000 {method 'contiguous' of 'torch._C._TensorBase' objects}
    26068    2.656    0.000    2.656    0.000 {method 'close' of '_io.BufferedReader' objects}
      326    2.234    0.007    2.234    0.007 {method 'run_backward' of 'torch._C._EngineBase' objects}
    26068    2.161    0.000    2.161    0.000 {method 'div' of 'torch._C._TensorBase' objects}
    52136    1.247    0.000  590.483    0.011 /home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/PIL/ImageFile.py:155(load)
    52136    0.999    0.000    4.623    0.000 /home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/pandas/core/internals/managers.py:1027(fast_xs)

The “cumtime” shows high values for decode, resize, and read, but I don’t know where this decode is in my code. For resize I suppose it comes from torchvision's Resize() and for read it must be from PIL.

I would like to understand how I can make my model run faster from this output.

My dataset is as follows:

from PIL import Image

from torch.utils.data import Dataset
from torchvision import transforms


class CGIARDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform

    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        y = self.df.iloc[idx]['extent']
        img = Image.open(self.df.iloc[idx]['filename'])
        x = transforms.ToTensor()(img)
        if self.transform is not None:
            x = self.transform(x)
        return x, y

As you can see, I’m reading the image using PIL, and converting to tensor using ToTensor from torchvision.

The resize step is in my transform object:

transform = transforms.Compose([
    transforms.Resize(IMG_SIZE, antialias=True)
])

Could anyone give me some tips on this?
For example, should I change my image reading from PIL to another library?
Or where does the decode come from?

Thanks!

If multiple workers in the DataLoader don’t help you could replace PIL with PIL-SIMD which could accelerate the decoding and transformation steps.

2 Likes

Thanks for your reply!

I installed PIL-SIMD as recommended here but I got the following error when running my training script:

Traceback (most recent call last):
  File "src/train_coral.py", line 9, in <module>
    from torchvision import transforms
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import datasets, io, models, ops, transforms, utils
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/torchvision/datasets/__init__.py", line 1, in <module>
    from ._optical_flow import FlyingChairs, FlyingThings3D, HD1K, KittiFlow, Sintel
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/torchvision/datasets/_optical_flow.py", line 10, in <module>
    from PIL import Image
  File "/home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/PIL/Image.py", line 89, in <module>
    from . import _imaging as core
ImportError: /home/igorf/.conda/envs/my-env/lib/python3.8/site-packages/PIL/_imaging.cpython-38-x86_64-linux-gnu.so: undefined symbol: PyObject_CheckBuffer

Do you know what is the problem?

I’m using an HPC cluster (CentOS 7).

No, unfortunately I haven’t seen this error before and don’t know what might be causing it. Note that I’m using PIL-SIMD myself and didn’t run into any issues.

Did you install with the “AVX2-enabled version” as stated here?

Yes, I pass CC="cc -mavx" to the build command.

I also passed the flag to install the AVX-2-enable version, but didn’t work.

To give you some updates:
I managed to solve the problem by saving resized images to the disk using Image.open(path).resize(IMG_SIZE) from PIL.

Now the training script is reading directly from a folder with resized images and it’s way faster: before it was ~4min per epoch, and now it’s roughly 1min, which helped a lot to experiment with new ideas.
Moreover, the GPU utilization is good now, so I think the resizing step was struggling with my training.

Thanks for your help and I also hope this helps future readers.

1 Like

I know this thread is a bit old now, but the core issue op hit back then is still the same today: the CPU decode/resize step is usually the real bottleneck. torchvision.transforms.v2 helped a lot by moving many ops to GPU (color jitter, normalize, etc.), but there’s still overhead because each op ends up being a separate GPU kernel launch.

If anyone here has already optimized the CPU side (TurboJPEG, WebDataset, DALI, or just pre-resizing offline as op did), there’s still a bit more speed you can squeeze out after the resize step.

I’ve been working on a small Triton-based library that fuses common torchvision.transforms.v2 image augmentation ops — crop, flip, brightness/contrast/saturation, grayscale, normalize — into one single gpu kernel:

It doesn’t fix decode/resize, but once your batch is already on the GPU, the fused kernel is usually 5–12× faster than torchvision v2’s separate kernels, especially for larger images.

Super easy drop-in:

augment = ta.TritonFusedAugment(
    crop_size=224,
    horizontal_flip_p=0.5,
    brightness=0.2, contrast=0.2, saturation=0.2,
    mean=(0.4914, 0.4822, 0.4465),
    std=(0.2470, 0.2435, 0.2616),
    same_on_batch=False  # Each image gets different random params (default)
)

for images, labels in train_loader:
    images, labels = images.cuda(), labels.cuda()
    images = augment(images)  # All ops in 1 kernel per batch! 🚀
    ...

Might help anyone who already solved the CPU bottleneck but still sees GPU-side augmentation show up in the profiler.