Prepare multivariate time series data for Seq2Seq (many-to-many of same length)

krypton · April 23, 2024, 12:36pm

I’m still struggling to get good performance with Seq2Seq models applied to time series. I still fall back on non-deep learning models. (1) Maybe I’m not preparing my data properly before feeding the Seq2Seq (2) Maybe I made a logic error when implementing SeqSeq.

In this discussion, I will present how I typically prepare multivariate time series data to power Seq2Seq models.

For example, let’s consider the following dataset:

# shape = (5, 3)
X = [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9],
     [10, 11, 12],
     [13, 14, 15]]

# shape = (5,)
y = [3.5, 6.5, 9.5, 12.5, 15.5]

Setting seq_len = 2 and stride = 2 we obtain the following sequences:

X_first_seq = [[1, 2, 3],
               [4, 5, 6]]
y_first_seq = [[3.5], [6.5]]

X_secon_seq = [[7, 8, 9],
               [10, 11, 12]]
y_secon_seq = [[9.5], [12.5]]

X_third_seq = [[13, 14, 15]]
y_third_seq = [[15.5]]

If we set batch_size = 2 we get:

batch_1 = [X_first_seq, X_secon_seq]   # shape = (2, 2, 3) = (batch_size, seq_len, input_size)
batch_2 = [X_third_seq]   # shape = (1, 2, 3) = (batch_size, seq_len, input_size)

As you can see, I always prepare my multivariate data to model Seq to Seqs (many-to-many of the same length).
Question 1: Is this way of preparing multivariate data good? Is there another way to do it that will be more effective?

In my real world data (frequency is 1 hour) X_train.shape = (12000, 11), y_train.shape = (12000) and X_test.shape = (5000, 11).
So we want to forecasting 5000 hours ahead (which corresponds to approximately 7 months)
I set

seq_len = 168 # 1 week
stride  = 168 # Each sequence is a slice of 1 week
batch   = 12 # Each batch contains 3 months

Question 2. How do you set your seq_len, stride and batch for large ahead forecasting?

Question 3. Should I segment the test set batches the same way as the train set batches?