Hi, I am trying to train a CVAE with a Transformer encoder. The shape of the encoder output is [Batch size, Sequence length, Embedding size]. However, the input shape of the recognition model is [Batch size, Embedding size].
I came out with two ideas to solve this problem. One is to use linear layers. I reshape the encoder output to [Batch size, Sequence length * Embedding size], as linear layer inputs. The other is to use 1-layer RNN. I utilize the last hidden states of the RNN, whose shape is [Batch size, Embedding size].
I don’t know which one is better in theory. Are there other solutions?
Thank you.