Sampling with(out) replacement in DataLoader?

theory_buff · September 1, 2022, 10:21am

I have a simplistic dataset – a torch tensor of size [N, D+1] : N samples of size D+1, denoting an input-output pair. I am using this tensor as input to the torch.utils.data.DataLoader() to sample rows for minibatch SGD optimization. (shuffle = True in the current setting).

My question is – by default, how does DataLoader sample from a dataset? Is it with replacement across batches, or without?

The documentation seems to have these details for specific samplers like torch.utils.data.RandomSampler(), however the default sampler for torch.utils.data.DataLoader() is None. (torch.utils.data : PyTorch documentation)

For example, for N=5 and batch_size=2 for the DataLoader, can a sequence of samples in an epoch be (1,2), (1,3), (4)? (i.e. 1 is being sampled repeatedly due to sampling with replacement, if it exists by default), or would it be something like (1,2), (4,5), (3)?

Thanks!

nivek · September 1, 2022, 4:16pm

I think the documentation of RamdomSampler should answer your question:

Samples elements randomly. If without replacement, then sample from a shuffled dataset. If with replacement, then user can specify num_samples to draw.

Let me know if any part of that is unclear.

theory_buff · September 1, 2022, 5:03pm

Does the DataLoader use RandomSampler by default? (default value of the sampler argument of Dataloader is None)

Also at the link you mentioned, what exactly does “on-demand” mean in the following: samples are drawn on-demand with replacement

srishti-git1110 · September 1, 2022, 5:46pm

Hi @theory_buff,
Firstly, since you are using shuffle=True, you will not need to specify any value for the argument sampler in your Dataloader.

Check this out to get an idea about the mechanism of how actually the data is sampled across DataLoader worker processes when shuffle=True.

And so, in a shuffle=True setting, I don’t think there is any duplication of samples across batches.
Why I think so?
The DataLoader by default applies sharding in each worker process to ensure no duplication is happening.

Additionally, see this (scroll down a bit) to get an idea of how duplication happens (when we use DataPipes instead of Dataset) but for sharding_filter( ). That said, I’ll restate my point : since sharding is by default applied across worker processes, I don’t think there should be duplication “by default”.

Feel free to point me to any resources that state otherwise.

nivek · September 1, 2022, 6:32pm

Does the DataLoader use RandomSampler by default? (default value of the sampler argument of Dataloader is None)

As discussed in this section, " a sequential or shuffled sampler will be automatically constructed based on the shuffle argument to a DataLoader".

It uses a sequential sampler when False, shuffled sampler when shuffle=True, unless you pass in a custom sampler.

Also at the link you mentioned, what exactly does “on-demand” mean in the following: samples are drawn on-demand with replacement

On-demand means the new index is created lazily (like a generator), in contrast with generated ahead of time and saved in a buffer.

theory_buff · September 2, 2022, 12:50pm

Thanks @srishti-git1110 , @nivek .