Datasets and DataLoaders for Sequential Data

jagraves21 · April 13, 2021, 1:21am

I am working with sequential data, and I am trying to create a custom torch.utils.data.Dataset that will reshape my data as necessary. The raw data is a collection of sequences (not all the same length) with readings taken at equally spaced time steps. Given window_len contiguous readings in a row, I want to make a point forecast horizon steps into the future. I’ve written the following custom Dataset to reshape my data for training:

import numpy as np
import torch.utils.data

class TimeSeriesDataset(torch.utils.data.Dataset):
    def __init__(self, Xs, ys, window_len, horizon):
        super(torch.utils.data.Dataset,self).__init__()
        self.window_len = window_len
        self.horizon = horizon
        self.X = np.vstack([
            self.get_X(X, window_len, horizon)
            for X in Xs
        ])
        self.y = np.vstack([
            self.get_y(y, window_len, horizon)
            for y in ys
        ])

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

    @staticmethod
    def get_X(values, window_len, offset):
        return np.hstack([
            np.roll(values, -ii, axis=0)
            for ii in range(0,window_len)
        ])[:values.shape[0]-window_len-offset+1].reshape(-1,window_len,values.shape[1])

    @staticmethod
    def get_y(values, window_len, horizon):
        return values[window_len+horizon-1:].copy()

Example usage:

window_len = 3
horizon = 2
n_sequences = 10
random = np.random.RandomState(0)

sequences = [
    random.randint(0, 100, size=(random.randint(5,10),2) )
    for ii in range(n_sequences)
]
targets = [
    random.randint(0, 100, size=(len(sequence),1) )
    for sequence in sequences
]

dataset = TimeSeriesDataset(sequences, targets, window_len, horizon)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True)
for X_train, y_train in dataloader:
    print(X_train.shape, y_train.shape)

My problem is that the original data is small enough to fit into memory, but the reshaped training data is too large. And so I have two questions:

Is this the best way to go about solving the problem if the data fits into memory?
What is the best way to modify this so that all of the sub-sequences don’t have to be created at once?

I’d like to still be able to use slicing on the dataset if possible (e.g., dataset[5:10]).