Hey, guys. I am doing a sequence classification task using
nn.TransformerEncoder(). Whose pipeline is similar to
I have tried several temporal features fusion methods:
Selecting the final outputs as the representation of the whole sequence.
Using an affine transformation to fuse these features.
Classifying the sequence frame by frame, and then select the max values to be the category of the whole sequence.
But, all these 3 methods got a terrible accuracy, only 25% for 4 categories classification. While using
nn.LSTM with the last hidden state, I can achieve 83% accuracy easily. I tried plenty of hyperparameters of
nn.TransformerEncoder(), but without any improvement for the accuracy.
Could you guys give me some practical advice? Thanks
I think that BERT uses the special
[CLS] token to get the entire sentence representation of the sequence in one token. Perhaps you can try something similar and then take this vector as your input. For BERT, they generally put this token in the first position.
From my experience, for classification tasks, usually the concatenation of last few encoder layers and then operating on this (using RNN or Linear) tend to give best results. You can also average or maxpool to reduce dimensions.
If you’re planning to use a pre-trained model like BERT, the output representation of
[CLS] token is your best bet. For some reason
[CLS] captures the meaning of the entire sequence (no theoretical backing to this).
The idea is that every token interacts with every other token. So if you have a token that captures the meaning of the entire sequence without influencing or having meaning within the sequence itself.