Best way to incrementally load data for LSTM using DataLoader

Currently, I have 500+ Pickle files that hold time-series data in the form of data frames, where each data frame represents a single day. Each of these data frames hold ~10,000 rows of data and ~500 features. I want to use the data and feed it through an LSTM model; however, loading the entire data set and doing a loop to create (input, output) tuples is too memory intensive. In addition, since 1 Pickle file does not represent a single LSTM input, I cannot simply load a single file in the get_item function as a solution.

How do I create a DataLoader for this situation? Would I have to load all of the data at once due to it being time-series data? Would I have to change the data format so that each file represents a single sample (which would also have the issue of just having too many files). Any help is appreciated!

I think the first things it o cut your 500 features to a much shorter one. Feature scaling is definitely one way to look. What kind of dataset is this where you have 500 features? Curious to know. Is this medical related or simply your are trying to do a performance test. :sunglasses:

Hi there! Have you tried creating a custom dataset to create the dataloader??
It’s a pytorch class, here are some examples:
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset
It was really useful to me!

1 Like

Hi! Thanks for the resources. I’m still new to DataLoaders and a little confused about the specific use-case. So I see that the get_item function is supposed to return a single sample for the model to train on, and the tutorial does this by loading in a file for each sample; however, that would not be practical for me as I would have a couple million files if I created a file per sample. Is there any alternative use case for DataLoader where I wouldn’t have to have millions of files?

If your data are images try with the Image Folder class, if not I’m sorry I cant help. My data is all inclusive in a csv so I read the csv and then I feed the dataset class with the data and the labels so I get the data prepared for the dataloader

You don’t have to manage for each DataLoader , will manage batches for you! You just have to manage the input and transformation for one item of the Dataset, the loader will manage the rest for the Batch. That is the point of having them in the first place.