I have two non-equal-length sequence data, one for the input and the other for the target. This data is not nlp data, but multivariate time series data. The data is for predicting values of the next M days(regression) using the features of the past N days.
In this case, I wonder how to handle the data in inferencing nn.Transformer or nn.TransformerDecoderLayer.
It doesn’t have any token like , so I have no idea how can I prepare the tgt when calling decoder’s forward()
So when data meet you will have Keys and Values coming from the encoder and Queries from the decoder
QK^T will give you a ( Num_of_tokens_2 x Num_of_tokens) matrix and then
Sm(QK^T) * V will give you a (Num_of_tokens_2 x Embed_size) matrix which will be the shape of the output of the decoder after all the layers.
Obs: all the layers conserve the shape of the input.
I would also highly recommend you to code the Transformer from scratch, is not that complicated and it will give you a better understading of the input/output shapes.
Here’s a good reference (but pay attention and check if it corresponds to the paper, there might be a few incoherences with the original model or oversimplifications)