How do I resample dataset with data augmentation to make dataloader larger?

vanstorm9 · August 31, 2021, 6:44pm

Hello everyone,
I am working with a Pytorch dataset that I want to make bigger by taking the entire dataset and duplicate it multiple times to have a larger dataloader (using for one-shot learning purposes). For example I have 10 classes containing 1 image each, leaving a total of 10 images (dataloader of length 10 for 1 batch). I want to resample the entire dataset multiple times (duplicate each image 20 times to have total of 200 images) and make each duplication different through data augmentation.

So (dataset of 1 cat, 1 dog) → (resample/duplicate dataset 20 times) → (data augmention) → (dataset of 20 cats, 20 dogs with variations from their original image) = (dataloader of size 40 for 1 batch)

What would be an effective way to accomplish this?

vis_dataloader = DataLoader(sample_dataset,
                        shuffle=True,
                        num_workers=8,
                        #num_workers=0,
                        batch_size=1)
dataiter = iter(vis_dataloader)

ptrblck · September 1, 2021, 5:22am

I’m not sure if you would really need to duplicate the images, since the data augmentation methods would be applied on each sample separately while loading the image.
I.e. keeping the 10 initial images (1 image per class) or duplicating them would only increase the number of iterations in an epoch (assuming the batch size is constant), but each images will be transformed randomly in both cases.

vanstorm9 · September 1, 2021, 1:05pm

So keeping the initial 10 images of batch size 1 would have the dataloader’s length be 10. I want to essentially have my model have more instances of the initial images for my model to train on.

For example, maybe within the 10 initial images (dataloader of size 10), the model preformed transformation on one class (shift to the right), and that’s it for that class. But I want to add more instances of the same image, but transformed differently. I want there to instances of the class where it has been shifted to the left, or maybe upward, etc.

My ultimate goal is to make the training data larger than just 1 image per class so that there will be more variety for my network to work with per class. Would there be any way I can increase the amount of image per class, but with the data augmentation being randomized/different on those duplicate images from the same class?

ptrblck · September 1, 2021, 6:52pm

I don’t see what the difference would be between duplicating images vs. using more epochs.
In both cases random transformations would be applied on the original data (in the first case you would just create copies of the original data and apply the transformations on them).

I’ll try to “visualize” it:

# duplicating data
initial_dataset = [0, 1, 2, 3]
dataset = [0, 1, 2, 3, 0, 1, 2, 3] # one epoch would contain 2*initial_dataset = 8 samples
loader = [trans(0), trans(1), trans(2), trans(3), trans(0), trans(1), trans(2), trans(3)] # each sample will be randomly transformed in this epoch

# keeping the original data
initial_dataset = [0, 1, 2, 3]
dataset = initial_dataset # one epoch would contain initial_dataset = 4 samples
loader = [trans(0), trans(1), trans(2), trans(3)] # each sample will be randomly transformed in this epoch
# train for 2 epochs
data_seen_in_2_epochs = [trans(0), trans(1), trans(2), trans(3), trans(0), trans(1), trans(2), trans(3)]

vanstorm9 · September 2, 2021, 12:09am

I was under the impression that once you initialize your dataloader on your dataset, the dataset and images inside the dataloader have effectively locked until you reinitialize the dataloader again.

I see what you are saying, but I am not sure how preforming the random and different transforms per epoch

What would be a proper way to reformat this code so that I have different transformations per epoch?


sample_dataset = sampleDataset(imageFolderDataset=folder_dataset,
                                        transform=transforms.Compose([
                                                    transforms.Grayscale(num_output_channels=3),
                                                    transforms.Resize((244,244)),
                                                    transforms.ColorJitter(brightness=(0.2,1.0),contrast=(0.1,1.1),hue=.05, saturation=(.0,.15)),
                                                    transforms.ToTensor()
                                                ]),
                                        should_invert=False)

vis_dataloader = DataLoader(sample_dataset,
                        shuffle=True,
                        num_workers=8,
                        #num_workers=0,
                        batch_size=1)
dataiter = iter(vis_dataloader)

epochs = 100

for i in range(0,epochs):
  print('epoch ',i)
  for inputImg, posAnc, negAnc, pathList in dataiter:
    concatenated = torch.cat((inputImg, posAnc,negAnc),0)
    imshow(torchvision.utils.make_grid(concatenated))

Also if I want to use the same transformation from one epoch, wait a few epochs, and use that same transform (for example, the same data-augmented-transformed image from epoch 1 reappears in epoch 10), how would I do that?

ptrblck · September 2, 2021, 3:12am

Tansformations are usually applied in the __getitem__ method of the Datset (I don’t know if you are using a custom Dataset or a pre-defined one) and random transformations will be random by default. In your case ColorJitter will use random values for each call and thus each sample (the same applies to e.g. RandomCrop).

One way would be to seed the code (I wouldn’t recommend using this approach, if you are not familiar with how seeing and the pseudo random number generator works) or to use the functional API via torchvision.transforms.functional and apply the transformations by creating the parameters for each transformation manually.