I need to split the CIFAR10 dataset into training and validation set. The problem is that I wish to apply augmentations to training data. These are applied while loading the data. But if I split the data into validation set it also contains the augmentations which I obviously don’t want
This snippet above applies augmentations to validation data (since it is just a part of the originally loaded data). Can I split the data before applying transforms?
You could create two different datasets (one for training and the other for validation) with different transformations. Afterwards you could sample the split indices and either use Subet of both datasets or a SubsetRandomSampler.
Usually you would implement a dataset with lazy loading, so that the actual data samples are loading in each call into __getitem__, and thus creating multiple datasets would be cheap (as no data would be loaded in the __init__).
Since CIFAR10 is considered small (50000 samples in the train set * 3 channels * 32 height * 32 width ~= 150MB) it’s loaded directly in the __init__.
If you are concerned about wasting these 150MB, you could derive a custom dataset using CIFAR10 as the parent class and split the datasets internally.
random_split expects to assign all samples of the dataset to the splits.
If you want to use a subset only, use torch.utils.data.Subset or a SubsetRandomSampler and pass the indices to them.