How does Dataloader choose batch?

PhysicsIsFun · May 16, 2022, 1:04pm

Greetings!

I came across the (apparently not so trivial question?) of how to subsample without replacement, i.e. from a set of {1,2,…,N} choose k<N elements without replacement, where each integer has the same probability.

I was wondering how the Torch Dataloader handles this? There seem to be algorithms of order N and others of order k.

One naive idea is to shuffle the whole number array and then choose the first k elements to be the samples (order N).
Then there is Yates algorithm:

choose a random int in the set
swap last int in the set with sampled int
repeat for the penultimate int in the set
Continue until one has k sampled ints for the last k elements in the set.

This is of order k.

KFrank · May 17, 2022, 3:55pm

Hi Physics!

I haven’t verified this, but I believe that with shuffle = True, the default
Dataloader shuffles the (indices for the) entire Dataset, and then returns
the samples in the Dataset one by one or batch by batch. There is no
“optimization” for sampling only k of the total N samples in the Dataset
because the expectation is that you will iterate over the entire Dataset
(and, in fact, do so multiple times as you train over multiple epochs).

Which is, in essence, what I believe Dataloader does, for example, in
the case where Dataset contains N samples, but you choose to sample
a single batch containing only k samples.

Note, as a quibble, forming the initial set of N integers to which you apply
this algorithm is an order-N operation.

Also, as a practical matter, the difference between order-k and order-N
will be fully irrelevant compared to the real work of processing the samples
in any realistic use case.

Best.

K. Frank

ID56 · May 17, 2022, 5:43pm

PyTorch has map-style datasets (ones with __getitem__() and __len__()) and iterable-style datasets (implements __iter__() instead). By a torch dataset, we generally mean a map-style dataset, as that is the most commonly used type.

You can go and read this part in the docs: Data Loading Order and Sampler. I will highlight the important part:

A sequential or shuffled sampler will be automatically constructed based on the shuffle argument to a DataLoader. Alternatively, users may use the sampler argument to specify a custom Sampler object that at each time yields the next index/key to fetch.

We can write our custom samplers if needed, but you are asking about the default sampler, I believe. Depending on shuffle=False and shuffle=True, the underlying sampler is either a SequentialSampler or a RandomSampler. You can check this inside DataLoader’s source code very quickly. I’ll again highlight the relevant part:

        if sampler is None:  # give default samplers
            if self._dataset_kind == _DatasetKind.Iterable:
                # See NOTE [ Custom Samplers and IterableDataset ]
                sampler = _InfiniteConstantSampler()
            else:  # map-style
                if shuffle:
                    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
                else:
                    sampler = SequentialSampler(dataset)

You can now go check RandomSampler’s source code to see exactly what is going on. As I see a call to torch.randperm(n), where n = length of dataset, it is apparent that the whole number array is shuffled first (or rather, a random ordered number array is constructed).