Dataloader just shuffles the order of batches or does it also shuffle the images in each batch?

zimmer550 · November 13, 2019, 8:42pm

Sorry for asking this basic question but I think I was always under the impression that Dataloader shuffle just reorders the batches without doing changing the order of the images. So for example, my batch size is 2 and my images are: 0, 1, 2, 3, 4, 5, 6, 7

If I call the dataloader with shuffle set to true, I get the following batches: [0, 1], [2, 3], [4, 5], [6, 7] and then the order of these batches is changed, so, in the end, I could get something like: [2, 3], [0, 1], [6, 7], [4, 5]

Is this how shuffle works in dataloader or is the order of the images changed entirely (meaning the order of the images changes to 3, 4, 7, 0, 1, 5, 2, 6) and then they are converted to batches?

albanD · November 13, 2019, 8:44pm

HI,

It should shuffle all the images and then give you batches of shuffled images. So it should shuffle images in each batch.
Can you give me a small code sample that reproduce what you observer please?

zimmer550 · November 14, 2019, 11:26am

I guess I had the wrong assumption then. I thought that images were left untouched and only the batches were shuffled around. Also, what if I use a random sampler and then call the dataloader with the random sampler and shuffle set to false. Will I get the behavior where the images are not shuffled but the batches are?

albanD · November 14, 2019, 6:07pm

Hi,

I don’t think we have a fixed-batched random sampler.
It should be “easy” to do: you can have you dataset of size real_size / batch_size. Have your dataset return a whole batch when asked for a single index .Use a regular random sampler and a batch-size of 1 for the dataloader.

zimmer550 · November 14, 2019, 6:27pm

Then why do some people use the pytorch random sampler and use that with the dataloader instead of just setting the shuffle argument to true in the dataloader?

albanD · November 14, 2019, 6:30pm

I don’t think there is any difference between the two
If you have some code reference, I can double check.

zimmer550 · November 14, 2019, 6:31pm

Thank you for your reply and for clearing the confusion. Actually, this is where I noticed this:

github.com

TengdaHan/DPC/blob/7bb73f4a934bbf34eda10a6102629bacb9b4e348/dpc/main.py#L303


                          big=use_big_K400)
elif args.dataset == 'ucf101':
    dataset = UCF101_3d(mode=mode,
                     transform=transform,
                     seq_len=args.seq_len,
                     num_seq=args.num_seq,
                     downsample=args.ds)
else:
    raise ValueError('dataset not supported')


sampler = data.RandomSampler(dataset)


if mode == 'train':
    data_loader = data.DataLoader(dataset,
                                  batch_size=args.batch_size,
                                  sampler=sampler,
                                  shuffle=False,
                                  num_workers=32,
                                  pin_memory=True,
                                  drop_last=True)
elif mode == 'val':

albanD · November 14, 2019, 6:34pm

I have to admit, I’m not sure why it is done this way in this code. Maybe for historical reasons, he was trying other samplers before.
You can double check the doc here to make sure what the behavior will be with shuffle or if you set the sampler.

franciscocms · April 30, 2020, 8:13pm

Hello! I’m quite new here, but I think my question is related to this topic.
When shuffle is True, it just kind of build batches with random indexes? Or it actually shuffles my dataset?
If it shuffles the entire dataset, my labels tensor is shuffled the same way?
I know it can be a stupid question, but as I am building my labels tensor inside my custom dataset class, I don’t know if shuffle can change the match I’m creating.
Thank you!

ptrblck · May 1, 2020, 1:42am

If shuffle=True is set in the DataLoader, a RandomSampler will be used as seen in these lines of code.
This sampler will create random indices and pass them to the Dataset.__getitem__ method as seen here.

Your data and target correspondence should thus hold, since the same index should be used to load these tensors.

RahimEntezari · August 4, 2020, 2:48pm

I was interested in your suggestion, but I do not know how to create such a dataset of of size real_size / batch_size

albanD · August 10, 2020, 7:16pm

The idea is to have an extra dimension.
In particular, if you use a TensorDataset, you want to change your Tensor from real_size, ... to real_size / batch_size, batch_size, ... and as for batch 1 from the Dataloader. That way you will get one batch of size batch_size every time. Note that you get an input of size 1, batch_size, ... that you might want to reshape to remove the leading 1.

B_Paudel · February 26, 2021, 9:30am

Is there a way to access/return the random indices passed to Dataset.getitem method?
Thank you

ptrblck · February 26, 2021, 9:31am

Yes, you can directly return them with the data:

def __getitem__(self, index):
    x = self.data[index]
    y = self.target[index]
    return x, y, index

Yuv · February 7, 2022, 12:16am

I know it is an old post, but I came across the exact same problem.
Can you give an example how to do what you suggested?
I’m working on CIFAR for example - and I want to have certain order of images and that the dataloder will only shuffle between batches and not inside a batch.
Thanks