Is it faster to resize an entire dataset before using DataLoader or should I use .resize() per batch?

mesllo · May 24, 2022, 9:59pm

I was wondering whether due to the high-resolution images in this large dataset it would affect my performance if I’m performing a resize step every time in the DataLoader, compared to just resizing it entirely from before.

ptrblck · May 25, 2022, 12:10am

I think it generally depends how you are loading the data.
Often you would be working with a large dataset, which either won’t fit entirely into the RAM or which would be wasteful to pre-load completely (long startup time; slow experiment iterations).
In this case you won’t be able to resize the entire dataset at once unless you store each image/sample using the new resized shape.

However, if you are dealing with a smaller dataset and are already preloading the dataset, you might see a speedup if you are resizing it beforehand as it would reduce the transformation workload. The expected speedup again depends on the actual use case so you would need to measure it.
Also, note that data augmentation approaches (such as random crops before resizing etc.) would not be possible anymore since your resize operation would be “static”.

mesllo · May 25, 2022, 9:57am

But I can still use a library such as PIL to manually resize this large high-res dataset offline, and then apply data aug transforms during training right? That’s what I mean, if I can save from having to do that one extra step and just do it once beforehand maybe there’s an advantage, but I’m not sure if it really matters since DataLoader uses workers to speed up training. Again I might be wrong here.

ptrblck · May 25, 2022, 9:30pm

Yes, assuming you would also resize the samples to a static shape during training and would not use a random transformation before.

I would recommend to profile both use cases to check if you would see a speed up. E.g. if your data loading pipeline is fast enough and would completely hide the latency by pre-loading the samples in the background using the larger images while the GPU is busy training the model, optimizing this pipeline would not yield any speedup.