How to enumerate over multiple time-series datasets?

Bastulli · September 14, 2020, 3:15pm

If I had a bunch of different image datasets I would just concatenate using “ConcatDataset” from torch.utils.data. For multiple times-series datasets, I cant concatenate because they need a rolling window and cant just continuously index from one datasets to another.

Is it possible to come up with a solution using custom dataloader or should I just iterate through a different pandas dataframe after each training episode?

Here is my custom dataloader that works with a 1D CNN with one time-series datasets at a time.

class MyDataset(Dataset):
    def __init__(self, data, window):
        self.data = data
        self.window = window
        print(data.tail())
        self.xData = torch.FloatTensor(data[['data_1','data_2','data_3']].values.astype('float'))
        self.yData = torch.FloatTensor(data['labels'].values.astype('int'))

    def __len__(self):
        return len(self.data) - self.window

    def __getitem__(self, index):
        target = self.yData[index+self.window]
        data_val = self.xData[index:index+self.window].reshape(in_channel, window_size)
        return data_val, target

# split data into train test set
df_train, df_test = train_test_split(df, test_size=0.3, shuffle=False)

# send our pandas dataframe with data to our custom data class
train_dataset = MyDataset(df_train, window_size)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = MyDataset(df_test, window_size)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)

A method that currently works but I don’t like doing.

# dictionary full of multiple pandas dataframes
count += 1
str_txt = list(dictionary.data.keys())[count]
data = dictionary.data[str_txt]
if count >= len(list(dictionary.data.keys()):
    count = 0

semaj · May 6, 2021, 4:09am

Did you ever find a solution to this?

Mark_Pinches · April 26, 2022, 10:27am

Ive had to just do something similar.
I have a bunch of individual timeseries arrays in a dictionary called item_dic

class TimeseriesDataset_from_Dict(torch.utils.data.Dataset):

def __init__(self, item_dic, n_steps = 10, seq_len=100):
    self.item_dic = item_dic
    self.dic_items = len(item_dic.keys())
    self.X_shape = item_dic[0].shape[0]
    self.seq_len = seq_len
    self.X_len = self.X_shape - (self.seq_len-1)
    self.n_steps = n_steps

def __len__(self):
    return int(self.X_len*self.dic_items/self.n_steps)

def __getitem__(self, index):
    dic_ind = (index*self.n_steps)//self.X_len
    X = self.item_dic[dic_ind]
    index = (index*self.n_steps)%self.X_len
    print (dic_ind, index)
    return X[index:index+self.seq_len]

dorien · June 12, 2023, 2:44pm

Interesting solution Mark_Pinches!

I am modifying it to include scaler and Y.

I am a bit confused what n_steps is here?