Best way to implement validation / train split using torchdata?

I’m using torchdata, and I was wondering what the best way was to implement a validation/train split.

Currently, I’m using fork + header to split the datapipe into 2 and take the first N samples for the validation dataset, then I use enumerator + filter to skip those same N samples from the training dataset but I wonder if there’s a better way.

Edit: I was also wondering if there was a good way to set the length of a data pipe, if we know its length ahead of time. (For example: maybe we know the number of files in a directory).

This isn’t exactly what you asked, but to me, validation/train (/test?) split is more a (crucial!) processual/organization topic than a pure data loading one.

In particular, I would highly recommend to define the split ahead of time (early on) and record it (e.g. as a set of .csv files) rather than doing it on the fly. This gives you the opportunity to easily re-use the same split, audit the split for inadvertent information leaks etc.

Best regards


One possible solution right now is to use:

train_dp, eval_dp = dp.demux(num_instances=2, classifier_fn=rand_fn)

where your classifier function will use a RNG and you will have to reset the seed of the RNG after each epoch (or at the beginning).

We are currently working on a DataPipe that can do that more easily. If that doesn’t meet your use or you have more feature request, feel free to upvote or comment on this GitHub issue.

Also, the easiest way to set a length of DataPipe is using .header(length).