Sampling from a PyTorch dataset

I would like to randomly sample from a PyTorch dataset and create a PyTorch dataset from the samples.
Currently, I am using data.random_split but I’m not sure if this is the best approach.
This is the code:

def generate_random_coreset(dataset, num_samples):
    # dataset is a pytorch dataset
    return data.random_split(dataset, [len(dataset) - num_samples, num_samples])

Does PyTorch offer a possible function to solve this?
I would appreciate any guidance and hits given on this question.

Could you please describe your use case in more detail? Also please checkout torch.utils.data.DataLoader which allows automatic batched loading of data possibly using parallel worker processes.

Thank you for the reply.
My use case is creating a coreset (or memory) from the task dataset in a continual learning setting. I would like to do the sampling before using the dataloaders.

You can use torch.utils.data.RandomSampler to create a Random Sampler for your Dataloader.
E.g.

random_sampler = data.RandomSampler(dataset, num_samples=num_samples)
dataloader = data.DataLoader(dataset, batch_size=k, sampler=random_sampler)

Note that use can pass a generator object to RandomSampler to get the same subset of random samples every time.

1 Like