[DataLoader] What if entire dataset is larger than memory size?

C_J · April 11, 2022, 8:47am

Let’s say I have 10000 datafiles x 100MB size.
In total, dataset size is ~1TB, which is much larger than memory.
To my understanding, in the most of the examples of DataLoader,
“dataset” is considered to be pre-allocated in the memory, and then feeded to DataLoader during the training:

dataloader = DataLoader(dataset, batch_size=20, shuffle=True)

If dataset is larger than memory, should I re-load some portion of dataset and re-define dataloader multiple times during the training iteration?

Specifically, I’m having trouble to implement DataLoader in DistributedDataParallel,
because I don’t seem to have much freedom in DDP coding (e.g. re-load dataset from harddisk during the parallel process).
But I’m also suspecting that my understanding of DataLoader not deep enough.

Please give me some advice,
Thanks!

ejguan · April 11, 2022, 2:00pm

If you are using Dataset class, try to move your loading logic into __getitem__ rather than in __init__ to make sure each data is fetched lazily.

With the similar idea, we create a new library called torchdata to make Iterable-style as the first citizen for large dataset. It basically creates multiple data operations to support lazy execution. Feel free to take a look and checkout our repo.

Here is the doc TorchData

C_J · April 17, 2022, 8:35am

Thanks, your answer was definitely helpful.
Although my issue haven’t solved fully yet, I will try to play with it more.