Data augmentation before training

Hi everyone,
I have a dataset with 885 images and I have to perform data augmentation generating 3000 training examples for each image by random translation and random rotation. However since the dataset would increase too much and I cannot store all the images on the disk. I know that I can perform transform ‘on the fly’ but I need to create the augment the dataset and then train the complete dataset.

Is there a way to augment data and save augmented ‘dataLoader’ in separate files, use another process to load the saved ‘dataloaders’ and train the network with all the examples?

After applying the transformation on your data (without training the model), you could save each sample via torch.save. Later in your training script, you could create a custom Dataset, as explained here, and load each augmented sample instead of the image.
Let me know, if you get stuck.

1 Like

I have tried to do the torch.save but the problem is that creating 3000 transformed sample for each image and store them would occupy too much memory on disk.

I misunderstood the question then, sorry.

Could you explain, how this workflow would look like:

I know that I can perform transform ‘on the fly’ but I need to create the augment the dataset and then train the complete dataset.
Is there a way to augment data and save augmented ‘dataLoader’ in separate files, use another process to load the saved ‘dataloaders’ and train the network with all the examples?

Usually you would lazily load the data in the Dataset.__getitem__ method and apply the transformation.
I don’t really understand the point of storing the DataLoader (or Dataset).
If you want to increase the number of returned samples, you could multiply the number of sample by 3000 in Dataset.__len__.

1 Like

Thank you so much for your help. The dataset contains 884 images. For each image in the dataset I need to generate 3000 new images rotated and translated randomly with related bounding boxes without storing them for training. But I do not know how generate exactly 3000 transformed image for each image and consequently how these transformations are reflected in the number of samples.

Thanks for the information.
If I understand the use case correctly, you would have 884*3000 images in each epoch, where each of the original 884 images will be randomly transformed 3000 times.
In that case, my previous proposal should work and this code snippet would show, what I meant:

class MyDataset(Dataset):
    def __init__(self, data, length):
        self.data = data
        self.data_len = len(self.data)
        self.len = length
        
    def __getitem__(self, index):
        data_idx = index % self.data_len
        print('index {}, data_idx {}'.format(index, data_idx))
        x = self.data[data_idx]
        return x
    
    def __len__(self):
        return self.len


data = torch.randn(10, 1)
length = 30        
dataset = MyDataset(data, length)
loader = DataLoader(dataset, batch_size=2)

for x in loader:
    print(x.shape)

Basically, you would artificially increase the number of samples by passing the length directly to the dataset and inside the __getitem__ method you could use the modulo operation to repeatedly sample from the data.
Let me know, if this would work for you.

Thank you so much. It works for me. I have a question if you could help me more. Following the tutorial https://pytorch.org/tutorials/beginner/data_loading_tutorial.html I apply the transformation in Dataset.__getitem__ method. I have a doubt. For each epoch the dataset contains always the same transformed samples? or for each epoch the samples in the dataset are transformed randomly and therefore the dataset is different for each epoch?

The data will be transformed randomly for each call and thus also for each epoch as long as you don’t manually seed the code and force the random number generator to apply the same “random” transformations.

1 Like