Split Dataset into 10 equal parts

Hey everyone,

I am still a PyTorch noob. I want to do Incremental Learning and want to split my training dataset (Cifar-10) into 10 equal parts (or 5, 12, 20, …), each part with the same target distribution.

I already tried to do it with sklearn (train_test_split) but it only can split the data in half:

from sklearn.model_selection import train_test_split

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

data1_idx, data2_idx= train_test_split(
np.arange(len(targets)),
test_size=0.5,
shuffle=True,
stratify=targets)

data1_sampler = torch.utils.data.SubsetRandomSampler(data1_idx)
data2_sampler = torch.utils.data.SubsetRandomSampler(data2_idx)

data1_loader = torch.utils.data.DataLoader(trainset, batch_size=4, sampler=data1_sampler)
data2_loader = torch.utils.data.DataLoader(trainset, batch_size=4, sampler=data2_sampler)

How would you do it in PyTorch? Maybe you can point me to some example code.

I think `sklearn.model_selection.StratifiedKFold might be useful as it allows you to create multiple splits in a stratified fashion.

1 Like

Have any plan to implement StratifiedKFold in PyTorch?

No, I don’t think there are plans to reimplement these scikit-learn methods as they wouldn’t benefit from Autograd and are already easily available and well tested. What’s your use case or potential advantage you are seeing in copying them?

I agree with you.
I just imagine the code that is fully implemented in PyTorch lol.
Thanks for your comment

Thank you for answering the question! I will try this.

I have a follow-up question: Using 10 dataloaders is the best way to do this? Because then there is much copy-paste code — maybe there is a cleaner way?

Also 2nd follow-up question: If I don’t want the same distribution for each part (so not stratified) but want to have a random distribution, then this is still easy to do (just changing the argument of the sklearn function).
But what if I want to create my own distributions, e.g. let’s say data part 1 consisting of: 50% of class 3, 10% of class 7, and remaining 8 classes have 5% each. Data part 2 consisting of: 15% of class 4, 15% of class 5, 15% of class 9, 15% of class 10, and the remaining 6 classes have 5% each. Data part 3 consting of: 30% of class 9, 20% … and so on, I think you know what I mean.
Is there a way to create this detailed data split in PyTorch or Sklearn? I guess the best way to do this data split/data preparation is not in PyTorch but with Numpy, Pandas, Vanilla Python whatever. And then load the 10 data parts into PyTorch. Can you confirm it?

Maybe appending the loaders into e.g. a list would be cleaner and avoid code duplication.

Yes, I would claim the easiest approach would be to reuse an already implemented method in any of these mentioned packages. One approach could be to create a WeightedRandomSampler in PyTorch using the desired class distribution and create the 10 loaders with them.