How to create hierarchical datasets/dataloaders?

Hi, I am asking here because it seemed like the right place, if it isn’t please tell me where to ask.

Consider a stream of tabular data.

import pandas as pd
import numpy as np


def data_stream():
    for _ in range(1000):
        df = pd.DataFrame({
            'a': np.arange(10000),
            'b': (np.arange(10000) + 10000)
        })
        yield df

Please assume the dataframes will be large (and different).

I want to create a dataloader for data that is arranged as I stated above.
batches should be of X rows of the current dataframe, until it is done (including shuffling flexibility ect.). Can throw away the last batch if it is not full.
Then, go on to the next dataframe, until StopIteration.

If it were a single dataframe, I would simply use the good old torch.utils.data.Dataset with a standard dataloader, with small configuration of the number of df rows per sample and be done.

If it were a stream of single sample per stream item, I would use torch.utils.data.IterableDataset exactly like the doc states.

However, I have both.

If I use a torch.utils.data.IterableDataset, I have to define a DataLoader for it, and I then lose the power of the DataLoader that would operate on the df itself. The same problem would arise in the other direction.

was move from here

What’s the acceptable behavior here about bumping posts?

You might find the recent discussions on composable datasets useful or interesting.

1 Like