I have a large multivariate time-series dataset in a .parquet file with which I want to do forecasting with the pytorch-forecasting library. The problem I am having is that the library assumes that your dataset fits into memory in a pandas dataframe to conduct dataloading. The dataset consists of a
time_idx and several covariates. The
group_idx identifies a respective individual timeseries and
time_idx is a sequential index of time steps.
To tackle the memory problem, I have looked at TorchData and DataPipes, but there is only a simple example about univariate timeseries data from a small csv file. I am hoping to achieve the following:
- load data without running into memory issues
- perform some normalization procedures of continuous and categorical variables, like target normalization and encoding categorical variables numerically
- from a larger individual timeseries sequence extract a subsequence that respects the encoding and decoding length
- possibly handle missing time steps within subsequence with a simple imputation scheme
- create training and validation split that respects the time horizon and ensures that there is no leak of information in the temporal dimension
Essentially, attempt to build a timeseries data pipeline that has similar capabilities like the pytorch-forecasting Dataset class but instead enable large scale datasets that do not fit into memory.
Looking at TorchData, a starting point for me is the
ParquetDataFrameLoader but I am not sure how to use it in conjunction with the other necessary steps, for example only querying an individual timeseries to form a sample that can be loaded in batches.
I have thought about a more traditional
Dataset approach, where to saved individual time series separately and build a file index to use the “standard” pytorch dataloading way, but there are more than 200k groups in my dataset so saving them all individually is also inconvenient. I am therefore wondering about suggestions about how to do this with TorchData and what the individual pieces should be or whether it is even possible currently. And if not if you have suggestions of other ways or libraries that could tackle this issue. Thank you in advance.