Let’s say I have 10000 datafiles x 100MB size.
In total, dataset size is ~1TB, which is much larger than memory.
To my understanding, in the most of the examples of DataLoader,
“dataset” is considered to be pre-allocated in the memory, and then feeded to DataLoader during the training:
dataloader = DataLoader(dataset, batch_size=20, shuffle=True)
If dataset is larger than memory, should I re-load some portion of dataset and re-define dataloader multiple times during the training iteration?
Specifically, I’m having trouble to implement DataLoader in DistributedDataParallel,
because I don’t seem to have much freedom in DDP coding (e.g. re-load dataset from harddisk during the parallel process).
But I’m also suspecting that my understanding of DataLoader not deep enough.
Please give me some advice,
Thanks!