I am working with sequential data, and I am trying to create a custom
torch.utils.data.Dataset that will reshape my data as necessary. The raw data is a collection of sequences (not all the same length) with readings taken at equally spaced time steps. Given
window_len contiguous readings in a row, I want to make a point forecast
horizon steps into the future. I’ve written the following custom Dataset to reshape my data for training:
import numpy as np import torch.utils.data class TimeSeriesDataset(torch.utils.data.Dataset): def __init__(self, Xs, ys, window_len, horizon): super(torch.utils.data.Dataset,self).__init__() self.window_len = window_len self.horizon = horizon self.X = np.vstack([ self.get_X(X, window_len, horizon) for X in Xs ]) self.y = np.vstack([ self.get_y(y, window_len, horizon) for y in ys ]) def __len__(self): return len(self.X) def __getitem__(self, idx): return self.X[idx], self.y[idx] @staticmethod def get_X(values, window_len, offset): return np.hstack([ np.roll(values, -ii, axis=0) for ii in range(0,window_len) ])[:values.shape-window_len-offset+1].reshape(-1,window_len,values.shape) @staticmethod def get_y(values, window_len, horizon): return values[window_len+horizon-1:].copy()
window_len = 3 horizon = 2 n_sequences = 10 random = np.random.RandomState(0) sequences = [ random.randint(0, 100, size=(random.randint(5,10),2) ) for ii in range(n_sequences) ] targets = [ random.randint(0, 100, size=(len(sequence),1) ) for sequence in sequences ] dataset = TimeSeriesDataset(sequences, targets, window_len, horizon) dataloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True) for X_train, y_train in dataloader: print(X_train.shape, y_train.shape)
My problem is that the original data is small enough to fit into memory, but the reshaped training data is too large. And so I have two questions:
- Is this the best way to go about solving the problem if the data fits into memory?
- What is the best way to modify this so that all of the sub-sequences don’t have to be created at once?
- I’d like to still be able to use slicing on the
datasetif possible (e.g.,