I’m using torchdata, and I was wondering what the best way was to implement a validation/train split.
Currently, I’m using fork + header to split the datapipe into 2 and take the first N samples for the validation dataset, then I use enumerator + filter to skip those same N samples from the training dataset but I wonder if there’s a better way.
Edit: I was also wondering if there was a good way to set the length of a data pipe, if we know its length ahead of time. (For example: maybe we know the number of files in a directory).
This isn’t exactly what you asked, but to me, validation/train (/test?) split is more a (crucial!) processual/organization topic than a pure data loading one.
In particular, I would highly recommend to define the split ahead of time (early on) and record it (e.g. as a set of .csv files) rather than doing it on the fly. This gives you the opportunity to easily re-use the same split, audit the split for inadvertent information leaks etc.
where your classifier function will use a RNG and you will have to reset the seed of the RNG after each epoch (or at the beginning).
We are currently working on a DataPipe that can do that more easily. If that doesn’t meet your use or you have more feature request, feel free to upvote or comment on this GitHub issue.