What is so special about special tokens?

ShaoMinLiu-Holmusk · March 30, 2022, 2:54pm

I am not sure if anyone can help to answer this here but I cannot seems to be able to find an answer from anywhere:
what exactly is the difference between “token” and a “special token”?

I understand the following:

what is a typical token
what is a typical special token: MASK, UNK, SEP, etc
when do you add a token (when you want to expand your vocab)

What I don’t understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens? If an example uses a special token, why can’t a normal token achieve the same objective?

tokenizer.add_tokens(['[EOT]'], special_tokens=True)

And I also dont quite understand the following description in the source documentation.
what difference does it do to our model if we set add_special_tokens to False?

add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

basingse · March 30, 2022, 7:43pm

As per the docs here

It means if you have added token = EOT then suppose your string contains eot i am here then eot will be treated differently if its special token.

For example, this is how it tooks

and also about first question, I dont have much idea as I never used add_tokens functionality (the work I do is mostly on NLP and here the default special tokens are enough)

ShaoMinLiu-Holmusk · March 31, 2022, 1:55am

interesting, in the above example, the word eot got splitted into two tokens despite EOT being a special token, does it mean that special tokens are case sensitive?

basingse · April 2, 2022, 6:15am

yeah. For special tokens there will no processing (like lowercase etc) so eot will be splitted but EOT will not be splitted.