Building a custom Cifar10 or any image dataset by including their augmented images

I would like to build a custom image dataset. Let’s say my dataset contains 100 images and its labels. I would like to create a new dataset with 2 augmentations for every image and include its labels too.

so, the dataset should contain 100 original images and labels + 100 images of some augmentation and labels + 100 images of some other augmentation and labels.

so, each label is repeated 3 times i.e. one for original and 2 for augmented images.

In PyTorch (and TensorFlow for that matter) how data augmentation works is that it’s done on the fly as part of image preprocessing (and preferably run in parallel on the CPU while the model is training on the GPU). This means that you do not need to run any augmentation and then store on disk, actually this is not recommended, although I guess this would be possible to do if you really want to.

This PyTorch tutorial is oftentimes referenced on this forum that could be worth checking out, it’s really great both to get a deeper understanding of custom dataloading but also insight in best practices to perform data augmentation. If you would like to see a minimal example of how performing data augmentation on the fly in PyTorch looks like you can see a code example I’ve created here

1 Like