How to convert 2D Transformer Encoder output to 1D matrix?

Hi, I am trying to train a CVAE with a Transformer encoder. The shape of the encoder output is [Batch size, Sequence length, Embedding size]. However, the input shape of the recognition model is [Batch size, Embedding size].

I came out with two ideas to solve this problem. One is to use linear layers. I reshape the encoder output to [Batch size, Sequence length * Embedding size], as linear layer inputs. The other is to use 1-layer RNN. I utilize the last hidden states of the RNN, whose shape is [Batch size, Embedding size].

I don’t know which one is better in theory. Are there other solutions?

Thank you.