Hi, I’m trying to find a good solution for creating a Dataset class to load batches from “chunked” parquet files (each file consists of thousands of records). The entire dataset (all parquet files) can’t fit on memory. The requirement is to have the batch size flexible and at least be able to shuffle within a buffer. (I also have a copy of the dataset but in saved as .npy files)
By the way, if you think that there is a better way to store the dataset besides storing them as multiple large parquet files, I would be very interested to know about it!
Thanks.
Cade