Splitting datapipe into train/eval

Hi, what is the most canonical way to perform a train/test split on a datapipe? I start off with a FileLister, and then would like to split it into 2 subsequent datapipes - one for train data and one for eval data. I don’t see any documentation related to that, and would like to do this in the most straightforward way possible.

You can use the Demultiplexer by doing something like this:

train_dp, eval_dp = dp.demux(num_instances=2, classifier_fn=rand_fn)

where rand_fn assigns a value 0 or 1 to each sample, which would determines the sample belongs to train or eval.

Let me know if this works for you.

I have seen this proposition already, however the trouble I have with it is following: since I’m using the iterable datapipeline I do not know up front which sample should go to which set. The solution I guess would be to preload all the filepaths, and then split them to train/val, and then let the classifier function assign them to one or the other based on that.

I guess this would be the way to do this right now, since there is no straightforward implementation of a datapipeline component that does just that (a stateful splitter which would remember which samples were in which set, so the split is the same in every epoch).

If you pass in a classifier function that is random but provided with a specific seed you should get what you are describing:

a stateful splitter which would remember which samples were in which set, so the split is the same in every epoch

Does that make sense?

1 Like

But wouldn’t the “inside randomness” of the function change between epochs? Would I have to reinitialize the pipeline with the same seed inbetween epochs?

Yep, your classifier function will use a RNG and you will have to reset the seed of the RNG after each epoch (or at the beginning).

I can see how this can be a bad user experience and we should build a DataPipe/function to handle this. I will track this within a GitHub issue.

Thanks for bringing this issue up! We appreciate your feedback!

1 Like