Datasets and DataLoaders for Sequential Data

I am working with sequential data, and I am trying to create a custom torch.utils.data.Dataset that will reshape my data as necessary. The raw data is a collection of sequences (not all the same length) with readings taken at equally spaced time steps. Given window_len contiguous readings in a row, I want to make a point forecast horizon steps into the future. I’ve written the following custom Dataset to reshape my data for training:

import numpy as np
import torch.utils.data

class TimeSeriesDataset(torch.utils.data.Dataset):
    def __init__(self, Xs, ys, window_len, horizon):
        super(torch.utils.data.Dataset,self).__init__()
        self.window_len = window_len
        self.horizon = horizon
        self.X = np.vstack([
            self.get_X(X, window_len, horizon)
            for X in Xs
        ])
        self.y = np.vstack([
            self.get_y(y, window_len, horizon)
            for y in ys
        ])

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

    @staticmethod
    def get_X(values, window_len, offset):
        return np.hstack([
            np.roll(values, -ii, axis=0)
            for ii in range(0,window_len)
        ])[:values.shape[0]-window_len-offset+1].reshape(-1,window_len,values.shape[1])

    @staticmethod
    def get_y(values, window_len, horizon):
        return values[window_len+horizon-1:].copy()

Example usage:

window_len = 3
horizon = 2
n_sequences = 10
random = np.random.RandomState(0)

sequences = [
    random.randint(0, 100, size=(random.randint(5,10),2) )
    for ii in range(n_sequences)
]
targets = [
    random.randint(0, 100, size=(len(sequence),1) )
    for sequence in sequences
]

dataset = TimeSeriesDataset(sequences, targets, window_len, horizon)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True)
for X_train, y_train in dataloader:
    print(X_train.shape, y_train.shape)

My problem is that the original data is small enough to fit into memory, but the reshaped training data is too large. And so I have two questions:

  1. Is this the best way to go about solving the problem if the data fits into memory?
  2. What is the best way to modify this so that all of the sub-sequences don’t have to be created at once?
  • I’d like to still be able to use slicing on the dataset if possible (e.g., dataset[5:10]).