I’m trying to implement an encoder-decoder LSTM model for a univariate time-series forecasting problem with multivariate covariates. In other words I have a predictor time series variable y and associated time-series features which will be helpful to predict future values of y. The structure of the encoder-decoder network as I understand and have implemented it are shown in the figure (apologies for the formatting of the key, i couldn’t get the last entry to format on one line correctly!).

Below is a description of a toy example where I want to predict y two steps into the future using the past three timepoints. The general concept being that the encoder LSTM will encode a context variable which can then be used to generate the prediction series sequentially. However I’m getting some tensor dimension mismatches which I don’t understand (all dims are included in the diagram and I am using a batch first approach). In the example I am only assuming a single LSTM layer for simplicity:

**Encoder:** I pass an input tensor consisting of the predictor variable and covariates at time t-2, t-1, t which will have dimensions (N, L, H_in) where L=3 (in the diagram I have unraveled the LSTM for each time input which is why L=1). The output of the encoder is the hidden state (h_t, c_t) which each have dimensions (1, N, H_out) where H_in is the number of covariate features+1 and H_out is the encoder LSTM hidden size.

**Decoder:** The previous hidden state of the Encoder model is passed as the initial hidden state of the decoder model as well as the current value of y at time t to predict y_t+1. The predicted value of y_t+1 and the hidden state at t+1 is then used to predict y_t+2.

Theoretically I think this should work, however, I am getting a dimension mismatch between the hidden state output from the Encoder and input of the Decoder. The output of the Encoder hidden state will be (1, N, H_out) where H_out is the hidden size of the encoder LSTM. The input of the Decoder LSTM is (N, 1, 1) as it is just the last known value of the predictor variable. My understanding is that the last dimension of the Hidden state should match the last dim of the decoder input, however this won’t unless the LSTM only has a hidden size of 1 (and pytorch is giving a dimension mismatch error). This will also be an issue when using the decoder to generate the output prediction and the hidden size of the decoder won’t be 1.

Do I have a fundamental misunderstanding of how encoder/decoder networks are used for time series forecasting or is there a step that I am missing. I have read and seen that encoder/decoder networks can be used for time series forecasting but I can’t understand how they get around this issue! A lot of the examples use embedding layers as they are for NLP which I think gets around this issue.