I have been making an effort to implement a vanilla transformer to do time series forecasting. In the normal transformer, it seems like the usual to input the right shifted encoder input into the decoder. For time series however, it seems to me like it would make sense to make the input of the decoder, the last input of the encoder plus your forecasting horizon. For example
Encoder Input: 1, 2, 3, 4
Decoder Input: 4, 5, 6, 7, 8, 9, 10, …
However, I’m not sure if this is the way to do it, or if I would be better of making the decoder input be 2, 3, 4, 5, 6, …etc. (right shifted)
I guess my question comes from a place of ignorance on how the decoder actually works. if the encoder-decoder attention is masked, then it makes sense to me to right shift, as it would be using just the previous time series to predict the next, and only have access to limited encoder information.
However, if the decoder has access to the entire encoder output, it makes sense to me to do it the first way. In this manner, the encoder decoder attention mechanism has access to the entire previous series, as it attempts to predict from time t + 1(the end of the encoder sequence) to time t + p. (p being the prediction window). I apologize if this question is unclear. If it is, let me know how and I will try to edit it in a better way