Concatenation while using Data Loader

Hello everyone,

I had a small question with respect to data augmentation and dataloader. In order to use use data augmentation in addition to the unaltered set of original images for training, I am using ConcatDataset in Data Loader, which consists of two data_transform operations on same dataset. (data_transforms_A contains many augmentation techniques while data_transforms contains only scaling to leave original image intact.) Is this the right approach? Also in each batch, lets say of size 64,what is the distribution of amount of images fetched from each dataset?
Please find below the code i am currently using,

train_loader = torch.utils.data.DataLoader(
torch.utils.data.ConcatDataset([datasets.ImageFolder(args.data + ‘/train_images’,
transform=data_transforms_A),
datasets.ImageFolder(args.data + ‘/train_images’,
transform=data_transforms)]), batch_size=args.batch_size, shuffle=True, num_workers=1)

Thank you.

3 Likes

yes, this seems correct to me. the entire training set should contain half of unaugmented images and half of augmented images.

1 Like

Hi Simon, thanks for the reply. Just to confirm, can we say the same about the way images are sampled in a particular batch (around half from each dataset)
Thanks.

Yes. What happens is there will be a random permutation using all data of these two datasets, and each batch will sequentially fetch segments of the permuted data.

So does it mean in the usual case where we only define one data loader for the training images (with some augmentations such as random crop or horizontal flip), then in the training process only transformed images are fed into the optimization loop and the original images are never used?

yes, this is typical in deep learning. however, most common transforms have probability of being an identical transform so the original data has a probability of being used.