Dataloader shuffles at every epoch

Hello everyone,

We have some problems with the shuffling property of the dataloader. It seems that dataloader shuffles the whole data and forms new batches at the beginning of every epoch.

However, we are performing semi supervised training and we have to make sure that at every epoch the same images are sent to the model.
For example let’s say our batches are as the following:
Batch 1 consists of images [a,b,c,…]
Batch 2 consists of images [ f,g,h,…]

Batch n consists of images [x,y,z,…]

So after one epoch we need the exact same batches at the other epochs as well. Because at every epoch we are using the images a,f,… and x from the example above. The model needs other images, that’s why we cannot eliminate them, however there is also a decent amount of necessity to obtain these specific first images.

Our training method necessitates that we should shuffle the data in the very beginning, form batches from that shuffled data, and use the same exact batches in the rest of the training.

In the beginning using a dataloader wouldn’t cause any problems but as I’ve mentioned before we’ve seen that new batches are formed at each epoch. We have also tried using the SubsetRandomSampler, but couldn’t accomplish anything.

2 Likes

try passing shuffle=False * as a parameter DataLoader(dataset,shuffle=False)
in the DataLoader

Hope this helps

If you would still like one initial shuffling you could maybe shuffle your dataset at the start somehow or make a custom sampler:

import torch
from torch.utils.data.sampler import Sampler
from typing import Iterator, Sized

class ConstantRandomSampler(Sampler[int]):
    def __init__(self, data_source: Sized) -> None:
        self.num_samples = len(self.data_source)
        generator = torch.Generator()

        self.shuffled_list = torch.randperm(self.num_samples, generator=generator).tolist()

    def __iter__(self) -> Iterator[int]:
        yield from self.shuffled_list

    def __len__(self) -> int:
        return self.num_samples

1 Like

Thank you so much for your responses :))
However, I am not familiar enough with the concept of samplers, can you propose a way to shuffle the dataset in the beginning?
Maybe then we can disable the shuffle option of the dataloader and obtain what we want.

I have the same problem in my project. I want to shuffle the dataset in the beginning of the training, just once. Although I use SubsetRandomSampler, the dataset is shuffled every epoch. In my research through the internet, I found that every iteration of complete dataset is considered as one epoch by the DataLoader. Do you have a solution for this issue? Thank you :slight_smile:

Use dataloader with shuffle=False and shuffle your data manually before adding into dataset

Hello everyone,
Thank you so much for all of your responses and yes we did it. Apparently scikit-library has a tool named shuffle :slight_smile: after importing that we’ve created a new variable for the shuffled version of the data. Giving that to the dataloader and disabling shuffling solved our problem.

pass the provided sampler as argument of sampler= to dataloader __init__ method.

See docs at
https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

Sampler docs:
https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler

The sampler solution shall be more efficient than using scikit library.

Another simple solution

If you want to get a shuffled Dataset, you can use Subset instead of using scikit.

ShuffledDataset = torch.utils.data.Subset(YourDataset, torch.randperm(len(YourDataset)))
1 Like