How does the decoder works in Transformers

Eyzzle · July 10, 2025, 11:49am

Hello everyone,

I am trying to create a Transformer model for time series forecasting. Let’s assume I am trying to predict the temperature at time t+1 using multiple covariates (including the temperature) from time t-6 to t.
I have fed into my encoder all the covariates from time t-6 to t, and fed my decoder the temperature at time t.

I now have two questions:

If I were to do an autoregressive model, and predict at time t+1, t+2, t+3, do I still only input the temperature at time t into my decoder ? I am not sure if I should give more in order to implement teacher forcing technics
Let’s assume I have another model that generates predictions for one of my covariates (say the pressure for instance). That means I also have available the predicted pressure at time t+1, which will help my model to predict a more accurate temperature. Where should this new variable be added ? In the encoder ? In the decoder ? In both ?

Thanks a lot for any input !

paulk · July 10, 2025, 1:13pm

Hi,
is there a reason why you want to use an encoder decoder setup? If I understand your setting correctly there seems to be no natural source and target sequences that would usually go into encoder and decoder respectively. For example: if you train an encoder decoder transformer to translate from French to English, it makes sense to me that your source sequence (the French sentence you want to translate to English) should go into the encoder and then your target sequence starts with a <start> token and you go from there. However, in your setup I don’t see an obvious choice for a split between source and target sequence (I think this is what you are wondering about as well in question 1). I would suggest just using a decoder only architecture that predicts t+1 from t-6 to t using just masked self attention, I don’t think you need cross-attention from a decoder here.
Concerning your second question, I am not totally sure I understand your situation correctly but I would just concatenate your inputs to the decoder with the new variable. To be specific, if your input was t and you got additional information about t say a variable x just define your new input t' = [t, x] and let the transformer figure out how to use it.
Hope that was somewhat helpful, feel free to clarify if I misunderstood something.

Eyzzle · July 10, 2025, 3:33pm

Hi,

Thanks for replying. Maybe I should add a bit of context. While doing some research on time series forecasting, it appears that Transformers have become standard architectures. I am currently trying to implement the TemporalFusionTransformer (TFT) from pytorch-forecasting. That is why I asked about an encoder/decoder setup.
For unrelated reasons, I have implemented my own dataloader which makes the task of using it a bit less straight-forward (but will eventually make me more competent !!)
Regarding the second question, I initially thought that aswell, but wouldnt that defy the concept of having time-indexed inputs ? As in for a timestep t I would have an input tensor [Temperature_t, Pressure_t, Wind_t, Pressure_t+1]. That’s why I was thinking it might be more relevant to put it directly into the decoder input ?
Since I am fairly new to transformers and most documentations refers to them for NLP tasks, I might not have fully understood every concept, hence my questions