Datapipe RandomSplit with Shuffle produce splits with duplicates

I was testing the new datapipes and I wanted to use the random splits to split my datapipe into training testing and validation.however i found that when I shuffle before I do the random splits i get duplicate values in my training and validation which is far from ideal. I could just shuffle the datapipes after splitting them but I was just wondering why this is happening since it a scary mistake to make to have a overlap of training and validation data

train, test, valid = IterableWrapper(range(10)).shuffle().random_split(total_length = 10, weights={"train": 0.4, "test": 0.3, "valid": 0.3}, seed=0)



[4, 7, 6, 5]
[8, 7, 0]
[5, 9, 2]

CC @nivek Is calling shuffle() before random_split valid as it seems the pipe will be reshuffled before each spit thus creating duplicates?

Thanks for the ping! @ptrblck’s description is correct. It is indeed re-shuffling before each split because the random_split is generated lazily, such that when you read a different group, it reads the source DataPipe again. In this case, it re-reads from Shuffler, whose ordering changes when it is re-read.

@Moust_Holmes My quick thought on this is:

  1. Shuffling before random split should not impact the randomness of the final results, because random_split is fully random anyway.
  2. If you are concerned about the ordering of the final output (because random_split preserves the sequential order, you can perform shuffle after random_split.
  3. If the above doesn’t work for you, you can consider caching the results after the initial shuffle (i.e. dp.shuffle().in_memory_cache()) prior to random_split, but this will use some memory and materialize partial result, rather than done lazily like in 2.

If none of these works, please let me know why and I am happy to discuss further.