Dataloader shuffles at every epoch

akgulozlem · October 25, 2021, 1:04pm

Hello everyone,

We have some problems with the shuffling property of the dataloader. It seems that dataloader shuffles the whole data and forms new batches at the beginning of every epoch.

However, we are performing semi supervised training and we have to make sure that at every epoch the same images are sent to the model.
For example let’s say our batches are as the following:
Batch 1 consists of images [a,b,c,…]
Batch 2 consists of images [ f,g,h,…]

Batch n consists of images [x,y,z,…]

So after one epoch we need the exact same batches at the other epochs as well. Because at every epoch we are using the images a,f,… and x from the example above. The model needs other images, that’s why we cannot eliminate them, however there is also a decent amount of necessity to obtain these specific first images.

Our training method necessitates that we should shuffle the data in the very beginning, form batches from that shuffled data, and use the same exact batches in the rest of the training.

In the beginning using a dataloader wouldn’t cause any problems but as I’ve mentioned before we’ve seen that new batches are formed at each epoch. We have also tried using the SubsetRandomSampler, but couldn’t accomplish anything.

Programmer-RD-AI · October 25, 2021, 2:16pm

try passing shuffle=False * as a parameter DataLoader(dataset,shuffle=False)
in the DataLoader

Hope this helps

ZimoNitrome · October 25, 2021, 2:21pm

If you would still like one initial shuffling you could maybe shuffle your dataset at the start somehow or make a custom sampler:

import torch
from torch.utils.data.sampler import Sampler
from typing import Iterator, Sized

class ConstantRandomSampler(Sampler[int]):
    def __init__(self, data_source: Sized) -> None:
        self.num_samples = len(self.data_source)
        generator = torch.Generator()

        self.shuffled_list = torch.randperm(self.num_samples, generator=generator).tolist()

    def __iter__(self) -> Iterator[int]:
        yield from self.shuffled_list

    def __len__(self) -> int:
        return self.num_samples

akgulozlem · October 25, 2021, 5:55pm

Thank you so much for your responses :))
However, I am not familiar enough with the concept of samplers, can you propose a way to shuffle the dataset in the beginning?
Maybe then we can disable the shuffle option of the dataloader and obtain what we want.

Suy · October 25, 2021, 5:59pm

I have the same problem in my project. I want to shuffle the dataset in the beginning of the training, just once. Although I use SubsetRandomSampler, the dataset is shuffled every epoch. In my research through the internet, I found that every iteration of complete dataset is considered as one epoch by the DataLoader. Do you have a solution for this issue? Thank you

my3bikaht · October 25, 2021, 7:22pm

Use dataloader with shuffle=False and shuffle your data manually before adding into dataset

akgulozlem · October 26, 2021, 2:01pm

Hello everyone,
Thank you so much for all of your responses and yes we did it. Apparently scikit-library has a tool named shuffle after importing that we’ve created a new variable for the shuffled version of the data. Giving that to the dataloader and disabling shuffling solved our problem.

Ren_Pang · October 26, 2021, 4:27pm

pass the provided sampler as argument of sampler= to dataloader __init__ method.

See docs at

Sampler docs:

The sampler solution shall be more efficient than using scikit library.

Another simple solution

If you want to get a shuffled Dataset, you can use Subset instead of using scikit.

ShuffledDataset = torch.utils.data.Subset(YourDataset, torch.randperm(len(YourDataset)))