Train and validation split on IterableDataset

Frederik_Semmel · June 16, 2020, 6:28am

What are the best options to split an IterableDataset into a train and validation set?

I am using an IterableDataset because the data is stored in multiple tfrecord files, which are easier or faster to read sequentially with generators.

The options I see are:

Split the data files (tfrecords) into training files and validation files.
This requires a lot of scripting and extra disk space.
Keep the IterableDataset the same, and define which batches should be used for validation, e.g. batch 1000-1100, 2000-2100, … is validation and the rest is training.
With this approach it is very difficult to have always the same validation step, as the IterableDataset is shuffled (with a buffer) and loaded with multiple workers.

Both don’t sound like a good solution. Has anyone solved a similar problem?

Thanks!