I was testing the new datapipes and I wanted to use the random splits to split my datapipe into training testing and validation.however i found that when I shuffle before I do the random splits i get duplicate values in my training and validation which is far from ideal. I could just shuffle the datapipes after splitting them but I was just wondering why this is happening since it a scary mistake to make to have a overlap of training and validation data
Thanks for the ping! @ptrblck’s description is correct. It is indeed re-shuffling before each split because the random_split is generated lazily, such that when you read a different group, it reads the source DataPipe again. In this case, it re-reads from Shuffler, whose ordering changes when it is re-read.
Shuffling before random split should not impact the randomness of the final results, because random_split is fully random anyway.
If you are concerned about the ordering of the final output (because random_split preserves the sequential order, you can perform shuffle after random_split.
If the above doesn’t work for you, you can consider caching the results after the initial shuffle (i.e. dp.shuffle().in_memory_cache()) prior to random_split, but this will use some memory and materialize partial result, rather than done lazily like in 2.
If none of these works, please let me know why and I am happy to discuss further.