[DataLoader] What if entire dataset is larger than memory size?

Let’s say I have 10000 datafiles x 100MB size.
In total, dataset size is ~1TB, which is much larger than memory.
To my understanding, in the most of the examples of DataLoader,
“dataset” is considered to be pre-allocated in the memory, and then feeded to DataLoader during the training:

dataloader = DataLoader(dataset, batch_size=20, shuffle=True)

If dataset is larger than memory, should I re-load some portion of dataset and re-define dataloader multiple times during the training iteration?

Specifically, I’m having trouble to implement DataLoader in DistributedDataParallel,
because I don’t seem to have much freedom in DDP coding (e.g. re-load dataset from harddisk during the parallel process).
But I’m also suspecting that my understanding of DataLoader not deep enough.

Please give me some advice,

If you are using Dataset class, try to move your loading logic into __getitem__ rather than in __init__ to make sure each data is fetched lazily.

With the similar idea, we create a new library called torchdata to make Iterable-style as the first citizen for large dataset. It basically creates multiple data operations to support lazy execution. Feel free to take a look and checkout our repo.

Here is the doc TorchData


Thanks, your answer was definitely helpful.
Although my issue haven’t solved fully yet, I will try to play with it more.