Using datapipes over variables already in memory

The only IterDataPipe that carries an object is IterableWrapper. However, I found that the following (which loads subtables from inside a dataframe) works

def build_simplest_dp(input_df, table_rows=1, batch_size=1):
    df  = input_df  # shortname
    dp = IterableWrapper(df.index)
    dp = dp.batch(table_rows)
    dp = dp.map(lambda idx: df.loc[idx])  # get batch
    dp = dp.batch(batch_size).map(list)
    return dp

as you can see, the final dp carries input_df after it has been passed on lambda idx: df.loc[idx].
Is this too hacky?
Which is the correct way to do this?
One drawback is that I don’t know if input_df is accessible from dp, even if it’s inside. What are the other drawbacks?

This is interesting use case. Technically, I don’t see drawback. But, it won’t gain the benefit to stream data. Since you have held dataframe into memory, you could simply iterate over the in-memory dataframe.

1 Like

That is exactly what I would like to do,

The thing is, batching, casting or augmenting are easier using datapipes. I would also like to get the dataloader benefits of prefetching and multiprocessing. This usecase is much more needed when using heavy augmentations.

My personal exp is that when using distributed training (e.g. training using pytorch-lightning) the dataloading was broken at some point because of the lambda functions. The following code did work instead:

def get_batch(idx, df):
    return df.loc[idx]

def build_simplest_dp(input_df, table_rows=1, batch_size=1):
    df  = input_df  # shortname
    get_batch_fn = partial(get_batch, df=df)
    dp = IterableWrapper(df.index)
    dp = dp.batch(table_rows)
    dp = dp.map(get_batch_fn)  # get batch
    dp = dp.batch(batch_size, wrapper_class=list)
    return dp

To be honest, I didn’t want to dig deep enough to understand the main problem, but it was something about the multiprocessing (or multithreading) and the pickling of some functions.

As I have solved the problem, I will close this issue, but it is for you to evaluate how common will be the “use datapipes to work with things in memory” usecase, and also which would be the “proper” way to do it.

Yeah. Lambda function doesn’t work well with multiprocessing. We do have a mechanism to override the behavior, but you need to have dill installed.