Chunked Data Randomization

Hi everyone. I’m trying to implement chunked data randomization in order to effectively randomize a dataset where getting random access to individual samples is impractical. Is there a good/canonical way to do this with the Dataset/DataLoaders classes?

I’m basically trying to do the following:

My dataset of N samples is split into M chunks (where M << N). I’d like to load K randomly selected chunks into memory and then randomly sample data from those chunks. As approximately one chunk worth of data is exhausted, you’d randomly load an additional chunk and continue to randomly draw samples from the chunks that are in memory.

(As some background, this is essentially how Deserializers in CNTK work, in which dataloading code returns chunks which are then randomized in the manner described above.)

Is there a simple way to do this? Based on my initial read of the documentation the only way to do this is to implement all of this buffering logic in the Sampler.

1 Like

Hi Bencherian,

I think Pytorch’s ChunkDataset meets exactly what you need here. It works with a ChunkDataReader to randomly pick chunks, shuffle examples in the chunk and caches them into a internal buffer. Dataloader will retrieve batch data from this buffer. When the buffer is exhausted, ChunkDataset will automatically loads more chunks.

To make this tailored for your data format, all you need to do is to implement your own ChunkDataReader, which knows how to read a single chunk of data with a given index. Pass this class to ChunkDataset and the rest logic is handled by ChunkDataset.

If this is something you are interested in, you can take a look at some example code in the test: DataLoaderTest::ChunkDataSetGetBatch

Hope this helps :slight_smile:

Thanks for the reply @xzhu1900. I was able to find out that this functionality exists in the C++ API, but I’m not sure how to use that functionality in my Python code as there don’t seem to be any existing bindings. I was told that you can use pybind for this but I’m not exactly sure how to structure that so I can write the ChunkDataReader in Python. Are there any good examples of similar work in the pytorch codebase that might serve as a good example?

Edit:
Soon after writing this I saw that there are PRs under way to create Python bindings for this. For anyone who is interested, see https://github.com/pytorch/pytorch/pull/21232 for details.

Yep, I was about to mention this PR :slightly_smiling_face:

This change includes the python binding for ChunkDataset and modifies python dataloader to accommodate it. Once it is merged, I think you just need to write the logic to parse your chunk data (ChunkDataReader) in c++ and the rest should be in python.

@thiagocrepaldi (the PR owner) should be able to answer more questions about this.

Thanks @xzhu1900!

@bencherian based on the PR you mentioned, you could write a chunk datareader entirely in python if and only if your batch is a vector basic types. Currently, there are bindings for the following basic type readers available on python:
ChunkDataReaderDouble, ChunkDataReaderInt16T, ChunkDataReaderInt64T, ChunkDataReaderUint8T, ChunkDataReaderFloat, ChunkDataReaderInt32T, ChunkDataReaderInt8T

In that case, you could do this:

import torch

# A batch is a array of floats
class PythonChunkDatareaderDouble(torch._C.data.ChunkDataReaderDouble):
    def __init__(self):
        print('constructor')
    def read_chunk(self, index):
        print('read_chunk {}'.format(index))
    def chunk_count(self):
        print('chunk_count')
        return 0
    def reset(self, size=None):
        print('reset {}'.format(size))

y=PythonChunkDatareaderDouble()
y.reset()
y.chunk_count()
y.read_chunk(0)

can you please specify dependencies versions ? I tried with python 3.8 and torch 2.0.1 but get “Traceback (most recent call last):
File “chunk_reader.py”, line 7, in
class PythonChunkDatareaderDouble(torch._C.data.ChunkDataReaderDouble):
AttributeError: module ‘torch._C’ has no attribute ‘data’”