What would be the best way to handle the following use case?
I use a Dataloader to read a dataset from a .txt file and then train.
train_text = "/home/vathsan/train.txt"
train_set = CustomDataset(train_text)
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, drop_last=True, pin_memory=True)
for epoch in range(0, num_epochs):
model.train()
for input_tr, target_tr in train_loader:
input_tr, target_tr = input_tr.to(device), target_tr.to(device)
optimizer.zero_grad()
output_tr = model.forward(input_tr)
The dataset consists of around 5000 images. For training purposes, I currently create 5 variations of each image and end up with a total of 25000 images (all saved on the disk).
I want to read n number of images from the original 5k images (for each epoch), then create x number of variations for each of the images and train using the specified batch number. For example, read 4 images then create 5 variations of each image on the fly (4 x 5 = 20 images), and then train the model using a batch size of 4 using the created dataset. For each epoch, I will be reading n (4 in this case) new images from the train.txt file and continue the process.
I am happy to try any alternative solutions. The only limitation is storage and I don’t want to store all data on disk and then read.
The default approach is to lazily load each sample in the Dataset and to apply random transformations on it. Would this not work for you or why do you want to explicitly create 5 variations for each sample?
I am working on an image upsampling problem where I train a model using undersampled data and target data. I want to create the undersampled data from the original (target) data on the fly as I’ve mentioned above. When it comes to undersampling I drop frames (columns) in an image using different factors (2, 3, 4 up to 10) and do some post-processing, so that the model can handle different scales of undersampled images.
I was wondering if there is a way to create the undersampled data during training rather than creating them all and reading from the disk.
Similar to using any transformation you could create these 5 variations inside the Dataset.__getitem__ and return 5 samples instead of 1. This would of course increase your batch size by 5x as well, but you would only create the variations on-the-fly for each sample.
I’m speculating here, but can imagine transforming the same sample multiple times and using it in a single batch might not be beneficial for the model training as you would reduce the randomness/shuffling in the training. However, I might have misunderstood your use case as I thought you want to explicitly perform the training in this way.
If not, I’m unsure why the standard approach of transforming a single sample inside the __getitem__ won’t work. If you have a specific requirement to use at least 5 different transformations of each sample, just train for at least 5 epochs as each sample will be randomly transformed once in an epoch (assuming your sampler picks each sample once which is also the default behavior).
The requirement is to use n number of random samples from the original dataset and then come up with m number of transformations for each of the samples, ending up with a total of n x m images. From these images select x number of images (as the batch size) and train the model. For every epoch, I will be selecting n number of samples (with randomness/shuffling) from the original dataset.
Would it be possible to use two data loaders to achieve this? Loading n number of samples in the first data loader, then transforming inside the __getitem__ and loading x images using a second data loader for training.