Hi,
My task is to make time series classification based on a previous n period of data.
I have multiple seperate dataframes containing time series data.
I concatenated then into a single one, before that set the fist n labels to none for each in order to avoid data containment from the previous dataframe.
With a collate function I filter these none labels and mixed data.
This part works fine.
class MyDatasetDf(Dataset):
def __init__(self, data, window):
self.data = data
self.window = window
def __getitem__(self, index):
x = self.data[index:index + self.window]
if np.isnan(x[-1][-1]): # if label is None
return None
else:
label = x[-1][-1]
features = x[:, :-1]
sample = {"input": features, "label": label}
return sample
def __len__(self):
len_valid_labels = np.count_nonzero(~np.isnan(self.data[:, -1]))
return len_valid_labels
def collate_fn(batch):
batch = list(filter(lambda x: x is not None, batch))
return torch.utils.data.dataloader.default_collate(batch)
def create_trainloader_df(batch_size, train_samples, window=240):
train_concat_df = pd.read_pickle(train_samples)
train_np_arr = train_concat_df.to_numpy()
train_dataset = MyDatasetDf(train_np_arr, window)
train_loader = DataLoader(train_dataset, collate_fn=collate_fn, batch_size=batch_size, shuffle=True)
return train_loader
Now, I’d like to use WeightedRandomSampler to address the imbalanced part, but these none labels cause problems.
If I create a list of dictionaries for each labels and their input data instead of reading from the dataframe the WeightedRandomSampler works well, but its highly inefficient, and couldn’t keep all that data in memory anyway.
Is there a better way to read time series data from multiple sources?