I am working with sequential data, and I am trying to create a custom torch.utils.data.Dataset
that will reshape my data as necessary. The raw data is a collection of sequences (not all the same length) with readings taken at equally spaced time steps. Given window_len
contiguous readings in a row, I want to make a point forecast horizon
steps into the future. I’ve written the following custom Dataset to reshape my data for training:
import numpy as np
import torch.utils.data
class TimeSeriesDataset(torch.utils.data.Dataset):
def __init__(self, Xs, ys, window_len, horizon):
super(torch.utils.data.Dataset,self).__init__()
self.window_len = window_len
self.horizon = horizon
self.X = np.vstack([
self.get_X(X, window_len, horizon)
for X in Xs
])
self.y = np.vstack([
self.get_y(y, window_len, horizon)
for y in ys
])
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
@staticmethod
def get_X(values, window_len, offset):
return np.hstack([
np.roll(values, -ii, axis=0)
for ii in range(0,window_len)
])[:values.shape[0]-window_len-offset+1].reshape(-1,window_len,values.shape[1])
@staticmethod
def get_y(values, window_len, horizon):
return values[window_len+horizon-1:].copy()
Example usage:
window_len = 3
horizon = 2
n_sequences = 10
random = np.random.RandomState(0)
sequences = [
random.randint(0, 100, size=(random.randint(5,10),2) )
for ii in range(n_sequences)
]
targets = [
random.randint(0, 100, size=(len(sequence),1) )
for sequence in sequences
]
dataset = TimeSeriesDataset(sequences, targets, window_len, horizon)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True)
for X_train, y_train in dataloader:
print(X_train.shape, y_train.shape)
My problem is that the original data is small enough to fit into memory, but the reshaped training data is too large. And so I have two questions:
- Is this the best way to go about solving the problem if the data fits into memory?
- What is the best way to modify this so that all of the sub-sequences don’t have to be created at once?
- I’d like to still be able to use slicing on the
dataset
if possible (e.g.,dataset[5:10]
).