Opinion about Batch Preprocessing vs Real Preprocessing for images

AlexandreBrown · January 20, 2022, 1:08am

Hello,
I’d like your opinion on some approaches for applying preprocessing to images for deep learning (eg: semantic segmentation).

Important note :

Take for account that this would be performed in a pipeline therefore every new training would re-apply this (consider the disk to be emptied between training for instance). Also take for granted that we have access to cpus and gpus.

Approach 1 : Batch Transforms, Real time data augmentation

This approach would essentially apply the transforms like resize, rescale (/255) and toTensor in a batch and save the transformed images on disk for training.
Then during training, only the data augmentation would be left to be applied (for the training set only).

TLDR: Transforms (images) => onDisk => Training + Data aug on the fly

Approach 2 : Batch Transforms, Batch data augmentation

This approach would apply the transforms like resize, rescale (/255),toTensor and the data augmentation for the training set then save it on disk for training.
Then during training, we’d use a dataset with transforms=None meaning no Transforms are applied, it would simply read the already processed images.

TLDR : Transforms(images) => Data Aug(images) => onDisk => Training

Approach 3 : Batch Data augmentation, Real time Training + Transforms

Same principle but for data aug (for training set)

Data augmentation (images) => onDisk => Training + Transforms on the fly

Approach 4 : Real time transforms, real time data augmentation

Here nothing is saved to disk besides the raw train/valid/test images.

Training + Transforms + Data augmentation on the fly.

Important things to keep in mind :

Keep in mind that we could decide to perform preprocessing on the cpu or gpu if we use any batch approach, this might or might not be advantageous.
We also might want to consider if having them on disk is useful to save it for reproducibility and lineage (in case we want to go back from a model in production for instance and see what it was trained on).
We also must keep in mind that if we perform the data aug on the fly then it is re-applied after each epoch therefore this might or might not improve model accuracy.

So what do you guys think ?

ptrblck · January 20, 2022, 5:23am

This sounds like a valid approach, as the data augmentation is applied on each sample during training and could potentially save some preprocessing time. In any case I would profile the image decoding (e.g. JPEG decoding) vs. loading the raw binary data (speed and file size). E.g. loading a 1200x1600 JPEG encoded file with a size of ~310kB results in a tensor of ~23MB, since raw pixels are stored if you are not resizing it.
This approach wouldn’t use any data augmentation and I would then claim it’s invalid. Even if you are applying it once, this could be seen as creating a “new” dataset (with flipped, cropped etc. images) which won’t be randomly transformed during training.
Same as 1. with the difference to store the augmented images on disk, which I believe shouldn’t yield any advantage.
Standard approach, so valid.