What are the best options to split an IterableDataset into a train and validation set?
I am using an IterableDataset because the data is stored in multiple tfrecord files, which are easier or faster to read sequentially with generators.
The options I see are:
- Split the data files (tfrecords) into training files and validation files.
This requires a lot of scripting and extra disk space. - Keep the IterableDataset the same, and define which batches should be used for validation, e.g. batch 1000-1100, 2000-2100, … is validation and the rest is training.
With this approach it is very difficult to have always the same validation step, as the IterableDataset is shuffled (with a buffer) and loaded with multiple workers.
Both don’t sound like a good solution. Has anyone solved a similar problem?
Thanks!