Faster Transforms (Precompiled)

Hi all,

I spent some time tracking down the biggest bottleneck in the training phase, which turned out to be the transforms on the input images. 500-3000 tiles need to be interactively transformed using the below Composition, which takes 5-20 seconds. I tried a variety of python tricks to speed things up (pre-allocating lists, generators, chunking), to no avail. I already use multiple workers. The issue seems to be each individual transforms takes some time, and it really adds up:

  self.img_finalize = transforms.Compose([
            transforms.ToPILImage(), # 1 s.
            transforms.RandomRotation(25), # 2 s.
            transforms.RandomResizedCrop(1024, scale=(0.5, 1.5), ratio=(.8,1.2), interpolation=2), # 0.5 s.
            transforms.Resize(resolution), # 0.5 s.
            transforms.RandomHorizontalFlip(p=0.5), # 0.5 s.
            transforms.RandomVerticalFlip(p=0.5), # 0.5 s.
            transforms.ColorJitter(brightness=0.2, contrast=0.1, saturation=0.05, hue=0.02), # 3 s.
            transforms.ToTensor(), # 2 s.
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), # 3 s.

Two things:

  1. I notice with the dataloader, every thread that was started needs to finish before any data is pushed to the GPU. If I add a print statement DONE at the end of the __getitem__() method and before the forward pass FOWARD, there is significant delay from the first DONE and first FORWARD, in fact there needs to be 16 (=n_workers) DONE's

This doesn’t happen if n_workers=None. I suspect some form of lock is implemented? I am working in the unusual situation where my batchsize is 1, but each batch is huge (it’s a MIL paradigm…). So it would be useful to process a batch asap once it’s __getitem__() method is done. This makes sense to me is batch_size > 1, as you need to be sure all batches are down before collating. But it’d be great if it could be relaxed if batch_size =1.

E.g. I see output like this, for example, and I don’t think this is the latency from sending to GPU (I checked that too, it’s much faster than the individual methods). Now it might be latency from sending from forked process to main?

  1. Is there any use in pre-compliling the transform? If there would be significant performance gains (and it’s possible), it’s something I definintely want to explore. Any advice or helpful links on this? I have some experience with exposing C++ to python.


1 Like

Hi there.
Each worker will process a whole batch. You need to use plenty of CPUs in order to make it to be efficient.
Obviously pre-computing preprocessing leads to overfit unless you make the system to do so in parallel to the training.

Btw try this library:

and let me know if you got a good improvement.

Lastly, the “proper” way of knowing if there is a bottleneck is checking gpu utilization. If it’s always >95% you are ok. If you have peaks then you should check what can you do.

Thanks for the reply.

I know that each worker processes the data defined in the getitem() method- the issue was two fold, the sheer time and the locking method of the dataloader.

I also know the bottleneck is this step because I profiled the usage using cProfile. Nearly all the compute time is spent in dataloader methods. Running the script with predefined tensors (putting the full weight on GPU side) is about 20 seconds per epoch, while with the transforms is about 4 minutes.

I guess the dataloader already implements PIL methods, can this pillow-simd be simply swapped?


The problem is there is no much you can do about it. Have you check it’s not an IO problem?
PILLOW-SIMD is drop-in, totally compatible with PIL.

BTW, after seeing this line transforms.RandomResizedCrop(1024 I realized you are working with high resolution images. That’s probably the reason why it takes ages.
Another thing you can consider is to implement your own transforms for tensors. This way you can preprocess normalization.
You can also crop and resize 1st, this way rotation will take less time


Yes, it is huge – the input size to the network alone is 2 GB :grimacing:

Everything is fine technically. I was mostly curious if a pure C++ implementation would actually be faster/possible.

I don’t know so much about PIL underlying code but I assume it’s not very simple.
If you have enough hard disk I would recommend you to use numpy memory map to load the data. If you save normalized data you would save plenty of time, namely: 1s ToPIL, 2s to tensor, 3s normalize.
And probably 1 second on fliping.

Drawback: to implement rotation and colorjitter

1 Like

The transforms are all implemented in C under the hood. The PyTorch vision transform functions are just wrappers around the PIL (pillow) library and the PIL operations are implemented in C. It’s unlikely (but possible) that the overhead of the Python wrapper pieces are the bottleneck.

As @JuanFMontesinos wrote, pillow-simd is faster than pillow. The accimage library is probably even faster, but only supports a few transforms.

If you’re trying to optimize data loading, you should try to simplify your program as much as possible and measure independent pieces. You’ve already ruled out the GPU part of the network. So temporarily get rid of the DataLoader and multiprocessing which are complicated and:

  1. Measure file loading times
  2. Measure the transform times on a single tile
  3. Measure the transform time on a composite (?) image

If you’re on Linux, use perf top to figure out if the bulk of the time is spent in Python or in the native C code.

From what you wrote, it seems likely that the transforms are just slow and your images are huge. There’s a good chance that some of your data augmentations are not very helpful for your task – try getting rid of some. For example, my experience with ColorJitter is that it had no effect for training ResNet classifiers on ImageNet. I’m also confused by the RandomResizedCrop followed by the Resize – that seems redundant?

1 Like

@colesbury and @JuanFMontesinos

Thanks for all your input. I was worth the ask, but it seems clear that for now I’m stuck with how things are. I will look into pillow-simd though, and accimage. The unfortunate part is likely that a for loop is required, as no batch transformations exist (yet).

In case you’re curious, the transformations are meant to recreate real and possible changes in the input data. ColorJitter is useful as these are stained tissue images, so this replicates different laboratory/scanner settings. RandomResizedCrop is used to effectively change focal properties and Resize is need to standardized the input resolution.

Thanks for all the input, my questions/hopes are answered!

I know this isn’t precisely what you asked about but to make your training phase faster you can always prepare your data ahead of time. So, for example, instead of running transforms for every iteration, just run them once on your entire dataset offline and then train with the resulting set. Yes, you will lose the randomness that you get with every single iteration, but in my experience it doesn’t improve things by a huge margin if you are smart about how you generate your training set offline.

You can even do a bit better than that and allow for some randomness by running a few fast transforms (resizes, flips etc) during each iteration on your otherwise offline-generated training set. That’s the approach I’m using and it seems to be a good middle ground.

The pytorch transforms can consume pillow and tensor objects. Does this imply I should stay with pillow objects for as long as possible, as I can then benefit from the pillow-simd?