I want to learn a fixed size representation from a variable-length sequence of vectors.

Util now, I used a bidirectional LSTM for this purpose, however as my sequences are rather long (up to 2000 vectors) I now want to try a Transformer Encoder instead.

While using a LSTM, I used the final hidden state of the LSTM cell as representation for the whole sequence. Unfortunately this method is not suitable for the transformer encoder, as it outputs one vector for each element in my input sequence.

In this post here, it is described that BERT inserts a special [CLS] token into the sequence, uses the representation output for this token as the representation for the whole sequence.

However, in the post they use the BERT tokenizer to get that token, which is not suitable for me as my raw sequence does not represent words (it is just a sequence of vectors).

Can anybody tell me how the special [CLS] should look like?

Thanks for your help!