Caching and sharing long chunked files

Problem Statement

Given is a corpus of large files with multiple non-overlapping chunks (i.e. samples) per file.
The corpus is too large to be pre-loaded completely.

Since DataLoader relies on lazy loading, each time a sample is requested, the entire file is loaded, the corresponding chunk extracted, then the file discarded. To remedy this inefficiency, cache file’s signal.

But, how to fit caching (sharing state) seamlessly with DataLoader or multiprocessing in general?

Attempts

  1. Chunking offline: not desired
  2. Pre-Loading corpus partially, split epoch into sub-epochs
  • not nice
  • reduces randomness
  1. Sharing memory
  • need to write to shared memory for loading to cache (slow)
  1. Splitting corpus, using DistributedSampler and DistributedDataParallel with 0 workers
  • avoiding shared memory
  • sampling to cure loss of randomness

Is there another more convenient way of handling this?

The new ChunkDataset API might help you!

The way it works is through a hierarchical sampling, meaning it splits the dataset in chunks (set of examples) which are shuffled. Each chunk has its samples shuffled too (second layer of shuffling). C++ or Python Dataloader will retrieve batches from an internal buffer that holds just a few chunks, not the whole corpus.

In order to use it, all you need to do is to implement your own C++ ChunkDataReader, which parses a single chunk of data. This new chunk datareader is then passed to ChunkDataset, which will handle all the shuffling and chunk by chunk loading for you.

Look at an test example at: DataLoaderTest::ChunkDataSetGetBatch

Currently there is only C++ support, but python bindings are on the way (https://github.com/pytorch/pytorch/pull/21232) and any feedback is welcome

Very Interesting!

Let’s say, we have a data set of N chunks each of which has M samples. N * M samples in total.
When buffering L chunks, I assume, the ChunkDataset has access to all L * M samples of the current buffer? After drawing batches from the buffer, the memory of the corresponding samples is freed, no matter which chunk the originate from? And as soon as there’s enough free memory, the next chunk is loaded to the buffer?

ChunkDataset will cache a few random chunks (n < N) in memory and create minibatches out of them. The internal cache is continuously replenished in background, in a way there is always available data for the user, but without loading the whole thing in memory

This is useful in a distributed computing scenario, where each worker will load only part of the dataset

sorry for the delay, I didnt get notified of you reply