Dataloader for imbalanced, discontinuous time series data

Hi,

My task is to make time series classification based on a previous n period of data.
I have multiple seperate dataframes containing time series data.
I concatenated then into a single one, before that set the fist n labels to none for each in order to avoid data containment from the previous dataframe.
With a collate function I filter these none labels and mixed data.
This part works fine.

class MyDatasetDf(Dataset):
    def __init__(self, data, window):
        self.data = data
        self.window = window

    def __getitem__(self, index):
        x = self.data[index:index + self.window]
        if np.isnan(x[-1][-1]):  # if label is None
            return None
        else:
            label = x[-1][-1]
            features = x[:, :-1]
            sample = {"input": features, "label": label}
            return sample

    def __len__(self):
        len_valid_labels = np.count_nonzero(~np.isnan(self.data[:, -1]))
        return len_valid_labels


def collate_fn(batch):
    batch = list(filter(lambda x: x is not None, batch))
    return torch.utils.data.dataloader.default_collate(batch)


def create_trainloader_df(batch_size, train_samples, window=240):
    train_concat_df = pd.read_pickle(train_samples)
    train_np_arr = train_concat_df.to_numpy()
    train_dataset = MyDatasetDf(train_np_arr, window)
    train_loader = DataLoader(train_dataset, collate_fn=collate_fn, batch_size=batch_size, shuffle=True)

    return train_loader

Now, I’d like to use WeightedRandomSampler to address the imbalanced part, but these none labels cause problems.

If I create a list of dictionaries for each labels and their input data instead of reading from the dataframe the WeightedRandomSampler works well, but its highly inefficient, and couldn’t keep all that data in memory anyway.

Is there a better way to read time series data from multiple sources?

If you only have first n label as None, why don’t you make sure the index map from [0, len(ds)-n) to [n, len(ds)) in your __getitem__ function

hi @ejguan, thank you for your answer.
The fist n label of each dataframes, so when I concatenate then, there are gonna be some in the middle too.

If you want to eliminate None from your Dataset, I would suggest you to have a way to map from [0, len(ds)) to the label with actual data.
Let’s say you have two dataframes to be concated.

# df1 label
None
None
None
3
4
5

#df2 label
None
None
2
3
4

# concat
# Index label
-1 None
-1 None
-1 None
0  3
1  4
2  5
-1 None
-1 None
3  2
4  3
5  4
1 Like

Thank you for you input!

I went a different way:
I created a new label for samples where the data would mix from different df-s.
After calculating the weight of different classes, I set the weight for this last class to 0.
This way those problematic cases never get selected by the sampler.

It works for me.

class MyDatasetDf(Dataset):
    def __init__(self, data, window):
        self.data = data
        self.window = window

    def __getitem__(self, index):
        x = self.data[index:index + self.window]
        features = x[:, :-1]
        label = x[-1][-1]
        return features, label

    def __len__(self):
        return len(self.data)-self.window+1

def create_trainloader_df(batch_size, df, window=4):
    df_np_arr = df.to_numpy()
    train_dataset = MyDatasetDf(df_np_arr, window=window)
    
    labels = df_np_arr[window-1:,-1]
    labels[np.isnan(labels)] = np.nanmax(labels)+1   # create a new class to nan labels
    labels = labels.astype(int)
    
    class_sample_count = np.array([len(np.where(labels == t)[0]) for t in np.unique(labels)])
    weight = 1 / class_sample_count
    weight[-1] = 0                                   # set the weight of the new (nan) class labels to 0
    samples_weight = np.array([weight[t] for t in labels])
    samples_weight = torch.from_numpy(samples_weight)
    sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=sampler)

    return train_loader