How to use dataset larger than memory?

thiagocrepaldi · June 27, 2019, 10:40pm

The new ChunkDataset API might help you!

The way it works is through a hierarchical sampling, meaning it splits the dataset in chunks (set of examples) which are shuffled. Each chunk has its samples shuffled too (second layer of shuffling). C++ or Python Dataloader will retrieve batches from an internal buffer that holds just a few chunks, not the whole corpus.

In order to use it, all you need to do is to implement your own C++ ChunkDataReader, which parses a single chunk of data. This new chunk datareader is then passed to ChunkDataset, which will handle all the shuffling and chunk by chunk loading for you.

Look at an test example at: DataLoaderTest::ChunkDataSetGetBatch

Currently there is only C++ support, but python bindings are on the way (https://github.com/pytorch/pytorch/pull/21232) and any feedback is welcome