Is there a data.Datasets way to use a sliding window over time series data?

HenryPowell · March 23, 2021, 2:58pm

Up until now I’ve always dealt with splitting time series data into inputs of a specific length by running a sliding window of a particular size over each datapoint and saving each of the windows to a seperate directory to train a model on.

This is very time and memory consuming and means that it’s pretty long winded to try different sized inputs into my models.

I’m wondering if there’s a way I could do this just by using transforms with a data.Datasets object. I’ve managed to do it by taking random samples of a particular size from each datapoint but obviously this means that I end up with significantly less data than by my normal means as I’m only taking a small number of windowed samples from each datapoint rather than lots of windows by using the sliding window method.

For a number of cases the dataset is too large to fit into RAM so it’s not possible to load it all and just slide a window over the whole dataset and get the output using getitem

Cheers.

JuanFMontesinos · March 23, 2021, 9:16pm

Save everything as numpy memory maps and then load specific parts of the array.
You can just generate a file with the information about each sample and specific stamps to load.

HenryPowell · March 24, 2021, 1:56pm

Thanks I’ll look into this. I can see how this might work by constructing a single np.memmap file for the whole data set and then randomly sample from this very large single array but can you say a bit more about how this would work with saving each time series as a memmap?

Cheers.

JuanFMontesinos · March 24, 2021, 7:23pm

Hi,
It varies a lil depending on some variables.
The main idea is that mmap allow you to set an instance of a numpy array without reading its content.
The content is read once you slice the array. This is obviously super efficient to read data from large arrays/tensors.

There are some variations you can do depending on two things, how many samples you have and how large they are.

The most efficient one is to create a large array by stacking all your samples (you may need to pad if they are different length).
Then you instantiate this mmap in the init function and slice it (read it) in the getitem.

If you have hundreds of files you can save them independently, instantiate the mmap in the init and read them in the getittem.
If you ahve millions of files, you can save them independently but you would probably need to instantiate the mmap and read in getitem.

This is really up to you.

All you need to do is to make an algorithm which predifines which samples and segments you are going to read, ensuring you use your data as much as possible.
From this you can infer the length of the dataset and code the getitem.

HenryPowell · March 25, 2021, 4:48pm

Thanks a lot for your suggestion. I actually managed to solve the issue by creating a list of tuples of (data_file_path, desired_index, class_label) and then just returned tuple((torch.load(data_file)[desired_index: desired_index + window_size], label)) and used this to generate batches.

For anyone interested this is the class:

class TimeSeriesDataSet(data.Dataset):

    def __init__(self, tensor_dir, transform, window_size, stride):

        self.tensor_directory = tensor_dir
        self.transform = transform
        self.files = os.listdir(tensor_dir)
        self.window_size = window_size
        self.stride = stride
        self.data_tuples = []

        for f in self.files:
            file = os.path.join(tensor_dir, f)
            data, label = torch.load(file)

            # pad with zeros with tensor is not of right length
            if data.size(0) % self.window_size != 0:
                zeros = torch.zeros(abs(self.window_size - (data.size(0) % self.window_size)), data.size(1)).double()
                data = torch.cat((data, zeros), axis=0)

            idxs = [i for i in range(0, data.size(0) - self.window_size, self.stride)]

            if len(idxs) == 0:
                continue

            for j in idxs:
                data_tuple = (file, j, label)
                self.data_tuples.append(data_tuple)

        shuffle(self.data_tuples)

    def __len__(self):
        return len(self.data_tuples)

    def __getitem__(self, idx):

        if torch.is_tensor(idx):
            idx = idx.tolist()

        sample_tuple = self.data_tuples[idx]
        sample, _ = torch.load(sample_tuple[0])
        label = sample_tuple[2]
        sample = sample[sample_tuple[1]: sample_tuple[1] + self.window_size]

        if self.transform:
            sample = self.transform(sample)

        return {'sample': sample, 'label': label}

JuanFMontesinos · March 25, 2021, 9:21pm

Well note that (it’s more or less what i was suggesting).
If you use that with mmap you can just load the desired slice and it will work faster

HenryPowell · March 26, 2021, 10:35am

Yes I didn’t mean to suggest that it was completely different from your suggestion. It was deinfitly very helpful. At the moment the speed doesn’t seem to be an issue but I’ll definitly look into using memmaps in the future.