How can I use an iterable dataset and dataloader for ML training?

saff · February 28, 2025, 1:11pm

Hi everyone,
I have a huge dataset (we’re talking about trillions and trillions of samples here). The samples are read from multiple files and are then streamed into the dataloaders. For my implementation, I’ve used an iterable dataset then called the dataloader.

From my understanding, if you use an iterable dataset then that means you need to take care of the shuffling? And we need to shuffle the data in each epoch to do so, to avoid overfitting.

I’m confused about how should I initalize the dataset and dataloader before training. Do I make a call to the dataset and dataloader in each epoch? How does the sequence go? Also do I need an dataset shuffler to shuffle the data in that case?

ptrblck · February 28, 2025, 2:25pm

Could you describe why you are using an iterable dataset instead of a map-style one?

saff · February 28, 2025, 2:29pm

From my research (could be wrong) if the data doesn’t fit in memory, then you use iterable-style dataset

ptrblck · February 28, 2025, 2:39pm

Not necessarily, since the standard approach is to lazily load the data anyway. I.e. in a map-style dataset you would store e.g. the paths to all images (without loading the actual image or data sample) and would then load it via the index in the __getitem__.

saff · February 28, 2025, 2:40pm

So does that mean the data is not loaded into memory at once? Also, if we use the map-style dataset we dont have to create a custom shuffler?

ptrblck · February 28, 2025, 3:08pm

Yes, if you use the lazy loading approach described before no data will be preloaded and only the needed sample for the current batch will be loaded into memory (times the batch size, number of workers, and prefetch factor but these are details).
Yes, you won’t need to apply any custom shuffling since the DataLoader will shuffle the indices if shuffle=True.

saff · February 28, 2025, 3:18pm

Amazing, thanks for the help! Does that mean that I initialize the dataset and the data loader once before looping through the number of epochs?

ptrblck · February 28, 2025, 3:19pm

Yes, check the ImageNet example in the PyTorch repository for an example.