Hello everyone.
I want to create a dataset based on CIFAR10. Then create a dataloader and train my model on it.
I have a function that gives some noises to the images of CIFAR10, say:
This method works. But the problem is about ‘Time’. Although the create_noise function can receive batch data, here it has to take the images one by one, which takes a lot of time.
Can we pass a batch of data in __getitem__ . If the answer is yes, how? And in this case, how is the dataloader made for the training?
Finally, I don’t have enough experience with PyTorch and I may not have chosen the right method.
I appreciate any help.
A custom Dataset should certainly work and depending on the create_noise method you could directly add the noise to the data as seen in this post or sample it in each iteration.
Alternatively, you could also write a custom transformation as seen in this post, which might be a better approach.
However, based on your description I understand that create_noise might be expensive and you want to avoid calling it for each sample and would thus prefer to call it for the entire batch.
In this case you could use the BatchSampler and pass the indices for the entire batch to __getitem__ as seen in this post. This would also mean that the specified batch_size in the DataLoader is not representing the actual batch size any more and the number of samples in each batch would be defined by loader.batch_size * sampler.batch_size. In my example I’ve defined it only in the sampler and kept it default in the DataLoader.