Potential issues with very large map style dataset

ChrisLiu2 · February 14, 2025, 5:17pm

I have a very large amount of files on disk (on the order of billions of samples) which I have an index and I can construct a map style dataset out of. My question is, is it a good idea to do this? E.g., during training, will the shuffle index mapper be a problem? I’m trying to gauging whether there will be problems with this approach, and I would appreciate any inputs/experience here. Thanks!