How can I use an iterable dataset and dataloader for ML training?

Hi everyone,
I have a huge dataset (we’re talking about trillions and trillions of samples here). The samples are read from multiple files and are then streamed into the dataloaders. For my implementation, I’ve used an iterable dataset then called the dataloader.

From my understanding, if you use an iterable dataset then that means you need to take care of the shuffling? And we need to shuffle the data in each epoch to do so, to avoid overfitting.

I’m confused about how should I initalize the dataset and dataloader before training. Do I make a call to the dataset and dataloader in each epoch? How does the sequence go? Also do I need an dataset shuffler to shuffle the data in that case?

Could you describe why you are using an iterable dataset instead of a map-style one?

From my research (could be wrong) if the data doesn’t fit in memory, then you use iterable-style dataset

Not necessarily, since the standard approach is to lazily load the data anyway. I.e. in a map-style dataset you would store e.g. the paths to all images (without loading the actual image or data sample) and would then load it via the index in the __getitem__.

So does that mean the data is not loaded into memory at once? Also, if we use the map-style dataset we dont have to create a custom shuffler?

Yes, if you use the lazy loading approach described before no data will be preloaded and only the needed sample for the current batch will be loaded into memory (times the batch size, number of workers, and prefetch factor but these are details).
Yes, you won’t need to apply any custom shuffling since the DataLoader will shuffle the indices if shuffle=True.

Amazing, thanks for the help! Does that mean that I initialize the dataset and the data loader once before looping through the number of epochs?

Yes, check the ImageNet example in the PyTorch repository for an example.

1 Like