Hi everyone,
I have a huge dataset (we’re talking about trillions and trillions of samples here). The samples are read from multiple files and are then streamed into the dataloaders. For my implementation, I’ve used an iterable dataset then called the dataloader.
From my understanding, if you use an iterable dataset then that means you need to take care of the shuffling? And we need to shuffle the data in each epoch to do so, to avoid overfitting.
I’m confused about how should I initalize the dataset and dataloader before training. Do I make a call to the dataset and dataloader in each epoch? How does the sequence go? Also do I need an dataset shuffler to shuffle the data in that case?
Not necessarily, since the standard approach is to lazily load the data anyway. I.e. in a map-style dataset you would store e.g. the paths to all images (without loading the actual image or data sample) and would then load it via the index in the __getitem__.
Yes, if you use the lazy loading approach described before no data will be preloaded and only the needed sample for the current batch will be loaded into memory (times the batch size, number of workers, and prefetch factor but these are details).
Yes, you won’t need to apply any custom shuffling since the DataLoader will shuffle the indices if shuffle=True.