Advice about Transformers for Time series

Hello there! I’ve been experimenting with using attention mechanisms and Transformers for time series forecasting. I am familiar with the already published discussions, e.g. this. My idea is to pass some time series to a transformer - [1,2,3,4,5] and get the forecast for [6, 7, 8, 9, 10]. Note that each item (1…n) should be a vector, not a scalar because I want to use multiple indicators for finance data.

  1. I see that in many examples TransformerDecoder is not used at all and the models consist of encoder and linear final layer only. Why is that and are my thoughts that implementing a decoder structure should increase the performance right? In which cases a decoder is necessary and why the majority of samples and tutorials don’t use it?
  2. When implementing a decoder one should give it targets and memory. Let’s suppose that for my use case the source to the encoder will be [1,2,3,4,5], the targets for the decoder will be [6,7,8,9,10] and the memory will be the output of the encoder. So while training it should be alright because we have both the sources and the targets in the set. But then how should one handle the predictions on new data? We do not have the targets there. But from my experience passing just zeros or ones (with positional encoding) to the decoder just confuses it more. Am I wrong to think of the targets as [6,7,8,9,10]? How should I properly implement it for training and then predicting on new data?