Given is a corpus of large files with multiple non-overlapping chunks (i.e. samples) per file.
The corpus is too large to be pre-loaded completely.
Since DataLoader relies on lazy loading, each time a sample is requested, the entire file is loaded, the corresponding chunk extracted, then the file discarded. To remedy this inefficiency, cache file’s signal.
But, how to fit caching (sharing state) seamlessly with DataLoader or multiprocessing in general?
Attempts
Chunking offline: not desired
Pre-Loading corpus partially, split epoch into sub-epochs
not nice
reduces randomness
Sharing memory
need to write to shared memory for loading to cache (slow)
Splitting corpus, using DistributedSampler and DistributedDataParallel with 0 workers
avoiding shared memory
sampling to cure loss of randomness
Is there another more convenient way of handling this?
The way it works is through a hierarchical sampling, meaning it splits the dataset in chunks (set of examples) which are shuffled. Each chunk has its samples shuffled too (second layer of shuffling). C++ or Python Dataloader will retrieve batches from an internal buffer that holds just a few chunks, not the whole corpus.
In order to use it, all you need to do is to implement your own C++ ChunkDataReader, which parses a single chunk of data. This new chunk datareader is then passed to ChunkDataset, which will handle all the shuffling and chunk by chunk loading for you.
Look at an test example at: DataLoaderTest::ChunkDataSetGetBatch
Let’s say, we have a data set of N chunks each of which has M samples. N * M samples in total.
When buffering L chunks, I assume, the ChunkDataset has access to all L * M samples of the current buffer? After drawing batches from the buffer, the memory of the corresponding samples is freed, no matter which chunk the originate from? And as soon as there’s enough free memory, the next chunk is loaded to the buffer?
ChunkDataset will cache a few random chunks (n < N) in memory and create minibatches out of them. The internal cache is continuously replenished in background, in a way there is always available data for the user, but without loading the whole thing in memory
This is useful in a distributed computing scenario, where each worker will load only part of the dataset