In a RNN you can use the last hidden layer as a vector representation of a sequence. I was wondering if there is any similar idea in the transformer architecture?
Background:
I am building a model that deals with protein data. At one stage I wish to include raw protein sequence information to the model. For this I need to encode a sequence into a fixed lenght vector that contains information about the sequence on the global scope. This vector will then be concatenated to some other vector and used in downstrema processing.
Ok interesting. So if I understand correctly the model will learn to generate this artificial vector ([CLS]) in such way that it will be usefull for solving the task at hand? In the linked post this would be text calssification.
People always say “look at BERT” but what if one wants to build ones own sequence-to-vector encoder? Most tutorials on BERT are limited to uses within machine translation (which is a sequence to sequence task), and they spend enormous amount of time on just setting up the dataset and the tokenizer and the mask and a single example configuration of BERT . . . As a guide this is entirely inadequate when it comes to studying sequence-to-vector encodings of the type ErikJ has written about . . . Even a simple sequence-to-sequence example that just talks about the use of the basic APIs and how to investigate various configurations would be useful . . . In other words, the tutorials need to get rid of all data-preparation talk and just work on artificial tensor sequences . . . It is not hard to build artificial sequences for experimentation purposes and for showing the use of the transformer APIs . . . Not all examples need to be about text sequences . . .
which has may of the similar set of issues as the task suggested here implies that one may first have to reduce one’s sequence lengths – perhaps, through convolutions with a large stride – before feeding the resulting (shorter) sequences to the transformer layers.
One or more convolution layers before the transformer layer (in problems such as the one mentioned where sequences are long and where we’re searching for full “sentence” encoding into a vector) can also improve the overall positional encoding of the full model.