Best way to implement validation / train split using torchdata?

Vedant_Roy · August 9, 2022, 12:23am

I’m using torchdata, and I was wondering what the best way was to implement a validation/train split.

Currently, I’m using fork + header to split the datapipe into 2 and take the first N samples for the validation dataset, then I use enumerator + filter to skip those same N samples from the training dataset but I wonder if there’s a better way.

Edit: I was also wondering if there was a good way to set the length of a data pipe, if we know its length ahead of time. (For example: maybe we know the number of files in a directory).

tom · August 9, 2022, 7:11am

This isn’t exactly what you asked, but to me, validation/train (/test?) split is more a (crucial!) processual/organization topic than a pure data loading one.

In particular, I would highly recommend to define the split ahead of time (early on) and record it (e.g. as a set of .csv files) rather than doing it on the fly. This gives you the opportunity to easily re-use the same split, audit the split for inadvertent information leaks etc.

Best regards

Thomas

nivek · August 9, 2022, 9:31pm

One possible solution right now is to use:

train_dp, eval_dp = dp.demux(num_instances=2, classifier_fn=rand_fn)

where your classifier function will use a RNG and you will have to reset the seed of the RNG after each epoch (or at the beginning).

We are currently working on a DataPipe that can do that more easily. If that doesn’t meet your use or you have more feature request, feel free to upvote or comment on this GitHub issue.

nivek · August 9, 2022, 10:10pm

Also, the easiest way to set a length of DataPipe is using .header(length).