Difference between using random sampler at Dataset vs DataLoader

Hi,
I’m trying to understand the difference between a Dataset and DataLoader for a specific case. The DataLoader allows us to specify a sampler.

Lets say I’ve a sampler that looks like the following.

class MyAwesomeSampler(Sampler):
    def __init__(self, indices):
        self.indices = indices
    
    def __iter__(self):
        return iter(self.indices)

    def __len__(self):
        return len(self.indices)

and then I give plug an instance of MyAwesomeSampler to the input of DataLoader

How is this different from using Subset(mydataset, indices) where the indices are the same as above.
Does this make any difference in Dataset objects with data-augmentation (eg any of the image datasets in torchvision).

1 Like

Your custom sampler and the Subset will yield the same data samles, if you don’t use shuffle=True for the DataLoader using the Subset.

The main difference is, that a custom sampler can do much more than just returning a subset of the data.
E.g. the WeightedRandomSampler implements a sample weight, which can be used to balance the batches for an imbalanced dataset.

It shouldn’t make any difference regarding data augmentation, as only the indices passed to __getitem__ will be customized, not the method itself.

2 Likes

Thank you for clarifying!