We have some problems with the shuffling property of the dataloader. It seems that dataloader shuffles the whole data and forms new batches at the beginning of every epoch.
However, we are performing semi supervised training and we have to make sure that at every epoch the same images are sent to the model.
For example let’s say our batches are as the following:
Batch 1 consists of images [a,b,c,…]
Batch 2 consists of images [ f,g,h,…]
Batch n consists of images [x,y,z,…]
So after one epoch we need the exact same batches at the other epochs as well. Because at every epoch we are using the images a,f,… and x from the example above. The model needs other images, that’s why we cannot eliminate them, however there is also a decent amount of necessity to obtain these specific first images.
Our training method necessitates that we should shuffle the data in the very beginning, form batches from that shuffled data, and use the same exact batches in the rest of the training.
In the beginning using a dataloader wouldn’t cause any problems but as I’ve mentioned before we’ve seen that new batches are formed at each epoch. We have also tried using the SubsetRandomSampler, but couldn’t accomplish anything.
try passing shuffle=False * as a parameter
in the DataLoader
Hope this helps
If you would still like one initial shuffling you could maybe shuffle your dataset at the start somehow or make a custom sampler:
from torch.utils.data.sampler import Sampler
from typing import Iterator, Sized
def __init__(self, data_source: Sized) -> None:
self.num_samples = len(self.data_source)
generator = torch.Generator()
self.shuffled_list = torch.randperm(self.num_samples, generator=generator).tolist()
def __iter__(self) -> Iterator[int]:
yield from self.shuffled_list
def __len__(self) -> int:
Thank you so much for your responses :))
However, I am not familiar enough with the concept of samplers, can you propose a way to shuffle the dataset in the beginning?
Maybe then we can disable the shuffle option of the dataloader and obtain what we want.
I have the same problem in my project. I want to shuffle the dataset in the beginning of the training, just once. Although I use SubsetRandomSampler, the dataset is shuffled every epoch. In my research through the internet, I found that every iteration of complete dataset is considered as one epoch by the DataLoader. Do you have a solution for this issue? Thank you
Use dataloader with shuffle=False and shuffle your data manually before adding into dataset
Thank you so much for all of your responses and yes we did it. Apparently scikit-library has a tool named shuffle after importing that we’ve created a new variable for the shuffled version of the data. Giving that to the dataloader and disabling shuffling solved our problem.
pass the provided sampler as argument of
sampler= to dataloader
See docs at
The sampler solution shall be more efficient than using scikit library.
Another simple solution
If you want to get a shuffled Dataset, you can use
Subset instead of using scikit.
ShuffledDataset = torch.utils.data.Subset(YourDataset, torch.randperm(len(YourDataset)))