Split dataset into training and validation without applying training transform

codeknight11 · March 20, 2021, 12:29pm

I need to split the CIFAR10 dataset into training and validation set. The problem is that I wish to apply augmentations to training data. These are applied while loading the data. But if I split the data into validation set it also contains the augmentations which I obviously don’t want

train_transform = transforms.Compose([transforms.RandomRotation(10), 
                                       transforms.RandomHorizontalFlip(),
                                       transforms.ToTensor(),
                                       transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

train_data = datasets.CIFAR10('data', train=True, download=True, transform=train_transform)

# Generate validation set - https://stackoverflow.com/a/51768651
train_size = int(len(train_data) * 0.8) # 80% training data
valid_size = len(train_data) - train_size # 20% validation data
train_data, valid_data = random_split(train_data, [train_size, valid_size])

This snippet above applies augmentations to validation data (since it is just a part of the originally loaded data). Can I split the data before applying transforms?

ptrblck · March 21, 2021, 7:24am

You could create two different datasets (one for training and the other for validation) with different transformations. Afterwards you could sample the split indices and either use Subet of both datasets or a SubsetRandomSampler.

codeknight11 · March 21, 2021, 7:37am

Yes, but how do I split the dataset before loading the data?
I have to supply the transform while loading the CIFAR dataset.

ptrblck · March 21, 2021, 7:53am

Usually you would implement a dataset with lazy loading, so that the actual data samples are loading in each call into __getitem__, and thus creating multiple datasets would be cheap (as no data would be loaded in the __init__).
Since CIFAR10 is considered small (50000 samples in the train set * 3 channels * 32 height * 32 width ~= 150MB) it’s loaded directly in the __init__.
If you are concerned about wasting these 150MB, you could derive a custom dataset using CIFAR10 as the parent class and split the datasets internally.

AM.MO · April 3, 2022, 8:45am

what should I do, if I want to split only 500 for training and 500 for validation, I tried different ways, and still give me this error.

ValueError: Sum of input lengths does not equal the length of the input dataset!

ptrblck · April 3, 2022, 9:23pm

random_split expects to assign all samples of the dataset to the splits.
If you want to use a subset only, use torch.utils.data.Subset or a SubsetRandomSampler and pass the indices to them.