Creating subset of dataset using a metric?

I am trying to do a pytorch implementation of the following paper,

This paper uses various metrics to prune the dataset into smaller sizes. Is there a way to achieve this in pytorch elegantly. I am currently using CIFAR10 and CIFAR100 datasets for this experiment so I was looking for something that could work with it, but I would love to learn other options for other kinds of datasets.

In general, the bottleneck is, the datasets can be very big.

I think you’d need to be more precise which of their experiments you want to implement.
For example, the finetuning one form section 6 would involve taking a pretrained-imagenet model (those are available in TorchVision or TIMM), computing embeddings and then cluster the embeddings (using Scikit Learn or so).

Best regards


I want to implement the pruning of CIFAR10 dataset to 50% of it’s original size. To be exact the experiment mentioned in section B of the suplimentary pdf.

But I don’t understand why the answer to the qu depends on the experiment. Basically it I have a pytorch dataset, I want to know how I can generate a subset of this dataset based on some formula which is datapoint specific.

In section B I see:

CIFAR-10 and SVHN. CIFAR-10 and SVHN model training was performed using a standard
ResNet-18 through the PyTorch library. Each model was trained …

Data pruning for transfer learning. To assess the effect of pruning downstream finetuning data
on transfer learning performance, vision transformers (ViTs) pre-trained on ImageNet21k were
fine-tuned on different pruned subsets of CIFAR-10. Pre-trained models were obtained from …

which does not mention how the CIFAR10/100 datasets were pruned.

You could use and calculate the indices based on the target from the dataset and your pruning logic.
For CIFAR10 you can directly access the targets via dataset.targets.