Behavior of Decoder Transformers

jupyter · February 10, 2024, 4:36am

Hi,

I’m using pretrained mT5 to do semantic parsing task. I dont know how the model can predict unseen token?

As I implemented RNN-based seq2seq, the model cannot predict unseen token (Though these token appear in the input, then the model have to learn this partern and copy to the output). the final layer is classification among all the tokens in the vocabulary. The model will be bias to all the token classes it have learned so it cannot predict unseen token.

However I have tried with pretrained transformers. And It worked? I dont understand this point. Like the last layer is classification layer right? Then why the model can predict the class It have never met before?

Thank you

J_Johnson · February 10, 2024, 5:11am

In transformer training for next token prediction, generally speaking, a triangular mask gets applied on the input and so every token in the input, except the first, gets trained on.

For instance, if we fed a model, “The fox jumped over the lazy dog.” It would get trained something like:

Input: The _ _ _ _ _ _. Target: fox
Input: The fox _ _ _ _ _. Target: jumped
…

And so on.

jupyter · February 10, 2024, 8:52am

Hi I understood how it was training

But in my downstream task. Its even copy some meaningless name from the input. For example:

I do NER problem just to detect entities without classify its type:
A conference abc was held on 12-9-2000
→ model’s output: conference abc | 12-9-2000. Which abc is meaningless token and unseen token (from the downstream task data), so the model somehow can learn and predict it via classification layer? the amazing thing is the class of token abc wasn’t trained.

I also trained a RNN Sequence to Sequence with Attention, it’s not capable of doing that. My thought is that the embedding vectors of unseen token (in the test set) wasn’t trained not like pretrained model. So the model treats every token which hasn’t been learned the same (pytorch initialized the weight of embedding by normal distribution as I remember) so it doesn’t know how to copy them.