Dataloader just shuffles the order of batches or does it also shuffle the images in each batch?

Sorry for asking this basic question but I think I was always under the impression that Dataloader shuffle just reorders the batches without doing changing the order of the images. So for example, my batch size is 2 and my images are: 0, 1, 2, 3, 4, 5, 6, 7

If I call the dataloader with shuffle set to true, I get the following batches: [0, 1], [2, 3], [4, 5], [6, 7] and then the order of these batches is changed, so, in the end, I could get something like: [2, 3], [0, 1], [6, 7], [4, 5]

Is this how shuffle works in dataloader or is the order of the images changed entirely (meaning the order of the images changes to 3, 4, 7, 0, 1, 5, 2, 6) and then they are converted to batches?

HI,

It should shuffle all the images and then give you batches of shuffled images. So it should shuffle images in each batch.
Can you give me a small code sample that reproduce what you observer please?

I guess I had the wrong assumption then. I thought that images were left untouched and only the batches were shuffled around. Also, what if I use a random sampler and then call the dataloader with the random sampler and shuffle set to false. Will I get the behavior where the images are not shuffled but the batches are?

Hi,

I don’t think we have a fixed-batched random sampler.
It should be “easy” to do: you can have you dataset of size real_size / batch_size. Have your dataset return a whole batch when asked for a single index .Use a regular random sampler and a batch-size of 1 for the dataloader.

Then why do some people use the pytorch random sampler and use that with the dataloader instead of just setting the shuffle argument to true in the dataloader?

I don’t think there is any difference between the two :slight_smile:
If you have some code reference, I can double check.

Thank you for your reply and for clearing the confusion. Actually, this is where I noticed this:

I have to admit, I’m not sure why it is done this way in this code. Maybe for historical reasons, he was trying other samplers before.
You can double check the doc here to make sure what the behavior will be with shuffle or if you set the sampler.

Hello! I’m quite new here, but I think my question is related to this topic.
When shuffle is True, it just kind of build batches with random indexes? Or it actually shuffles my dataset?
If it shuffles the entire dataset, my labels tensor is shuffled the same way?
I know it can be a stupid question, but as I am building my labels tensor inside my custom dataset class, I don’t know if shuffle can change the match I’m creating.
Thank you!

If shuffle=True is set in the DataLoader, a RandomSampler will be used as seen in these lines of code.
This sampler will create random indices and pass them to the Dataset.__getitem__ method as seen here.

Your data and target correspondence should thus hold, since the same index should be used to load these tensors.

1 Like

I was interested in your suggestion, but I do not know how to create such a dataset of of size real_size / batch_size

The idea is to have an extra dimension.
In particular, if you use a TensorDataset, you want to change your Tensor from real_size, ... to real_size / batch_size, batch_size, ... and as for batch 1 from the Dataloader. That way you will get one batch of size batch_size every time. Note that you get an input of size 1, batch_size, ... that you might want to reshape to remove the leading 1.

Is there a way to access/return the random indices passed to Dataset.getitem method?
Thank you

Yes, you can directly return them with the data:

def __getitem__(self, index):
    x = self.data[index]
    y = self.target[index]
    return x, y, index
1 Like

I know it is an old post, but I came across the exact same problem.
Can you give an example how to do what you suggested?
I’m working on CIFAR for example - and I want to have certain order of images and that the dataloder will only shuffle between batches and not inside a batch.
Thanks

1 Like