Transformers to encode a sequence into a fixed lenght vector

In a RNN you can use the last hidden layer as a vector representation of a sequence. I was wondering if there is any similar idea in the transformer architecture?

Background:
I am building a model that deals with protein data. At one stage I wish to include raw protein sequence information to the model. For this I need to encode a sequence into a fixed lenght vector that contains information about the sequence on the global scope. This vector will then be concatenated to some other vector and used in downstrema processing.

you can define your own way (\eg average pool all token vectors) or simply using the BERT default setup, the [CLS] token’s output vector

1 Like

Ok interesting. So if I understand correctly the model will learn to generate this artificial vector ([CLS]) in such way that it will be usefull for solving the task at hand? In the linked post this would be text calssification.

[CLS] output can be treated as the last hidden layer of LSTM

People always say “look at BERT” but what if one wants to build ones own sequence-to-vector encoder? Most tutorials on BERT are limited to uses within machine translation (which is a sequence to sequence task), and they spend enormous amount of time on just setting up the dataset and the tokenizer and the mask and a single example configuration of BERT . . . As a guide this is entirely inadequate when it comes to studying sequence-to-vector encodings of the type ErikJ has written about . . . Even a simple sequence-to-sequence example that just talks about the use of the basic APIs and how to investigate various configurations would be useful . . . In other words, the tutorials need to get rid of all data-preparation talk and just work on artificial tensor sequences . . . It is not hard to build artificial sequences for experimentation purposes and for showing the use of the transformer APIs . . . Not all examples need to be about text sequences . . .

2 Likes

The suggestion here

which has may of the similar set of issues as the task suggested here implies that one may first have to reduce one’s sequence lengths – perhaps, through convolutions with a large stride – before feeding the resulting (shorter) sequences to the transformer layers.

One or more convolution layers before the transformer layer (in problems such as the one mentioned where sequences are long and where we’re searching for full “sentence” encoding into a vector) can also improve the overall positional encoding of the full model.

1 Like

Thenk you so much, this will be most helpfull!