# How does Dataloader choose batch?

Greetings!

I came across the (apparently not so trivial question?) of how to subsample without replacement, i.e. from a set of {1,2,…,N} choose k<N elements without replacement, where each integer has the same probability.

I was wondering how the Torch Dataloader handles this? There seem to be algorithms of order N and others of order k.

One naive idea is to shuffle the whole number array and then choose the first k elements to be the samples (order N).
Then there is Yates algorithm:

• choose a random int in the set
• swap last int in the set with sampled int
• repeat for the penultimate int in the set
Continue until one has k sampled ints for the last k elements in the set.

This is of order k.

Hi Physics!

I haven’t verified this, but I believe that with `shuffle = True`, the default
`Dataloader` shuffles the (indices for the) entire `Dataset`, and then returns
the samples in the `Dataset` one by one or batch by batch. There is no
“optimization” for sampling only `k` of the total `N` samples in the `Dataset`
because the expectation is that you will iterate over the entire `Dataset`
(and, in fact, do so multiple times as you train over multiple epochs).

Which is, in essence, what I believe `Dataloader` does, for example, in
the case where `Dataset` contains `N` samples, but you choose to sample
a single batch containing only `k` samples.

Note, as a quibble, forming the initial set of `N` integers to which you apply
this algorithm is an order-`N` operation.

Also, as a practical matter, the difference between order-`k` and order-`N`
will be fully irrelevant compared to the real work of processing the samples
in any realistic use case.

Best.

K. Frank

PyTorch has map-style datasets (ones with `__getitem__()` and `__len__()`) and iterable-style datasets (implements `__iter__()` instead). By a torch dataset, we generally mean a map-style dataset, as that is the most commonly used type.

You can go and read this part in the docs: Data Loading Order and Sampler. I will highlight the important part:

A sequential or shuffled sampler will be automatically constructed based on the `shuffle` argument to a `DataLoader`. Alternatively, users may use the `sampler` argument to specify a custom `Sampler` object that at each time yields the next index/key to fetch.

We can write our custom samplers if needed, but you are asking about the default sampler, I believe. Depending on `shuffle=False` and `shuffle=True`, the underlying sampler is either a SequentialSampler or a RandomSampler. You can check this inside DataLoader’s source code very quickly. I’ll again highlight the relevant part:

``````        if sampler is None:  # give default samplers
if self._dataset_kind == _DatasetKind.Iterable:
# See NOTE [ Custom Samplers and IterableDataset ]
sampler = _InfiniteConstantSampler()
else:  # map-style
if shuffle:
sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
else:
sampler = SequentialSampler(dataset)
``````

You can now go check RandomSampler’s source code to see exactly what is going on. As I see a call to `torch.randperm(n)`, where n = length of dataset, it is apparent that the whole number array is shuffled first (or rather, a random ordered number array is constructed).

1 Like