I have a simplistic dataset – a torch tensor of size [N, D+1] : N samples of size D+1, denoting an input-output pair. I am using this tensor as input to the torch.utils.data.DataLoader() to sample rows for minibatch SGD optimization. (shuffle = True in the current setting).
My question is – by default, how does DataLoader sample from a dataset? Is it with replacement across batches, or without?
The documentation seems to have these details for specific samplers like torch.utils.data.RandomSampler(), however the default sampler for torch.utils.data.DataLoader() is None. (torch.utils.data : PyTorch documentation)
For example, for N=5 and batch_size=2 for the DataLoader, can a sequence of samples in an epoch be (1,2), (1,3), (4)? (i.e. 1 is being sampled repeatedly due to sampling with replacement, if it exists by default), or would it be something like (1,2), (4,5), (3)?
Samples elements randomly. If without replacement, then sample from a shuffled dataset. If with replacement, then user can specify num_samples to draw.
Hi @theory_buff,
Firstly, since you are using shuffle=True, you will not need to specify any value for the argument sampler in your Dataloader.
Check this out to get an idea about the mechanism of how actually the data is sampled across DataLoader worker processes when shuffle=True.
And so, in a shuffle=True setting, I don’t think there is any duplication of samples across batches.
Why I think so?
The DataLoader by default applies sharding in each worker process to ensure no duplication is happening.
Additionally, see this (scroll down a bit) to get an idea of how duplication happens (when we use DataPipes instead of Dataset) but for sharding_filter( ). That said, I’ll restate my point : since sharding is by default applied across worker processes, I don’t think there should be duplication “by default”.
Feel free to point me to any resources that state otherwise.