Split Dataset into 10 equal parts

danman · January 12, 2022, 10:30pm

Hey everyone,

I am still a PyTorch noob. I want to do Incremental Learning and want to split my training dataset (Cifar-10) into 10 equal parts (or 5, 12, 20, …), each part with the same target distribution.

I already tried to do it with sklearn (train_test_split) but it only can split the data in half:

from sklearn.model_selection import train_test_split

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

data1_idx, data2_idx= train_test_split(
np.arange(len(targets)),
test_size=0.5,
shuffle=True,
stratify=targets)

data1_sampler = torch.utils.data.SubsetRandomSampler(data1_idx)
data2_sampler = torch.utils.data.SubsetRandomSampler(data2_idx)

data1_loader = torch.utils.data.DataLoader(trainset, batch_size=4, sampler=data1_sampler)
data2_loader = torch.utils.data.DataLoader(trainset, batch_size=4, sampler=data2_sampler)

How would you do it in PyTorch? Maybe you can point me to some example code.

ptrblck · January 13, 2022, 6:54am

I think `sklearn.model_selection.StratifiedKFold might be useful as it allows you to create multiple splits in a stratified fashion.

thecho7 · January 13, 2022, 9:38am

Have any plan to implement StratifiedKFold in PyTorch?

ptrblck · January 13, 2022, 9:45am

No, I don’t think there are plans to reimplement these scikit-learn methods as they wouldn’t benefit from Autograd and are already easily available and well tested. What’s your use case or potential advantage you are seeing in copying them?

thecho7 · January 13, 2022, 9:59am

I agree with you.
I just imagine the code that is fully implemented in PyTorch lol.
Thanks for your comment

danman · January 13, 2022, 10:34am

Thank you for answering the question! I will try this.

I have a follow-up question: Using 10 dataloaders is the best way to do this? Because then there is much copy-paste code — maybe there is a cleaner way?

Also 2nd follow-up question: If I don’t want the same distribution for each part (so not stratified) but want to have a random distribution, then this is still easy to do (just changing the argument of the sklearn function).
But what if I want to create my own distributions, e.g. let’s say data part 1 consisting of: 50% of class 3, 10% of class 7, and remaining 8 classes have 5% each. Data part 2 consisting of: 15% of class 4, 15% of class 5, 15% of class 9, 15% of class 10, and the remaining 6 classes have 5% each. Data part 3 consting of: 30% of class 9, 20% … and so on, I think you know what I mean.
Is there a way to create this detailed data split in PyTorch or Sklearn? I guess the best way to do this data split/data preparation is not in PyTorch but with Numpy, Pandas, Vanilla Python whatever. And then load the 10 data parts into PyTorch. Can you confirm it?

ptrblck · January 13, 2022, 5:12pm

Maybe appending the loaders into e.g. a list would be cleaner and avoid code duplication.

Yes, I would claim the easiest approach would be to reuse an already implemented method in any of these mentioned packages. One approach could be to create a WeightedRandomSampler in PyTorch using the desired class distribution and create the 10 loaders with them.