The only IterDataPipe that carries an object is IterableWrapper. However, I found that the following (which loads subtables from inside a dataframe) works
as you can see, the final dp carries input_df after it has been passed on lambda idx: df.loc[idx].
Is this too hacky?
Which is the correct way to do this?
One drawback is that I don’t know if input_df is accessible from dp, even if it’s inside. What are the other drawbacks?
This is interesting use case. Technically, I don’t see drawback. But, it won’t gain the benefit to stream data. Since you have held dataframe into memory, you could simply iterate over the in-memory dataframe.
The thing is, batching, casting or augmenting are easier using datapipes. I would also like to get the dataloader benefits of prefetching and multiprocessing. This usecase is much more needed when using heavy augmentations.
My personal exp is that when using distributed training (e.g. training using pytorch-lightning) the dataloading was broken at some point because of the lambda functions. The following code did work instead:
To be honest, I didn’t want to dig deep enough to understand the main problem, but it was something about the multiprocessing (or multithreading) and the pickling of some functions.
As I have solved the problem, I will close this issue, but it is for you to evaluate how common will be the “use datapipes to work with things in memory” usecase, and also which would be the “proper” way to do it.