I have a large dataset (I have split it into 2000 files, each with an hour of 256hz time series data, too large for memory), and my model takes 30 seconds of this data as input. I want to sample these 30 second windows randomly from the data. I was worried that using a map style dataset would be too slow considering it would have to load a whole file in order to get a 30 second window from it, which turned out to be true. I made a custom batch sampler and a custom dataset, where getitem calls the pandas function read_csv with skiprows and nrows parameters to try and only load what I need, but it is still incredibly slow due to the read_csv call on every sample to a file with ~900k rows, even with batching and multiple workers.
First, if I just saved a standard dataset with predefined windows rather than taking random intervals from the data and used a pytorch random sampler on it, would this be an efficient solution? I don’t really know how the stock dataset classes work or if they would be efficient with randomly sampling a dataset of this size. (I would be fine with this if the sampling was well-distributed, worked quickly, and didn’t require storing the data many times over)
I’m open to any suggestions about how to approach this. Thanks.