I am currently building my own dataloader. The objective is to perform time series forecasting.
Note: I could not use the builtin TimeSeriesDataSet from pytorch-forecasting due to the nature of my dataset.
As an exemple, let’s assume I am forecasting weather, using the following dataframe:
X = pd.DataFrame(data={
'temperature': np.random.random((1, 10)).ravel(),
'pressure': np.random.random((1, 10)).ravel(),
'humidity': np.random.random((1, 10)).ravel(),
})
print(X.to_markdown())
temperature | pressure | humidity | |
---|---|---|---|
0 | 0.501873 | 0.741631 | 0.500776 |
1 | 0.639229 | 0.716319 | 0.846043 |
2 | 0.305061 | 0.78736 | 0.2809 |
3 | 0.666592 | 0.241905 | 0.534717 |
4 | 0.29799 | 0.758383 | 0.217077 |
5 | 0.398248 | 0.537553 | 0.524409 |
6 | 0.0699319 | 0.706717 | 0.74684 |
7 | 0.707643 | 0.821382 | 0.29689 |
8 | 0.620412 | 0.788375 | 0.512174 |
9 | 0.0802374 | 0.804594 | 0.231062 |
I want to predict the temperature at t+1 using the features at t-7, t-6, …, t.
Now in addition to that, let’s assume I have an a priori on the pressure data: I know it is relevant only for the past 2 days before the prediction (I only need the pressure at time t, t-1, t-2). Therefore, I do not want to add values of pressure prior to this because it will act as noise for the model. Additionally, my dataset is rather small which is why I want my data to be as useful as possible.
Since an RNN expects a dimension ( batch_size x n_timestep x feature_size )
, how should I fill the values for the pressure during the time (t-7, …, t-3).
Should I do a simple backfill where the pressure value at (t-7, …, t-3) is equal to the pressure value at t-2 ?
Should I zero out the values at (t-7, …, t-3) ?
Thanks in advance!