What is so special about special tokens?

I am not sure if anyone can help to answer this here but I cannot seems to be able to find an answer from anywhere:
what exactly is the difference between “token” and a “special token”?

I understand the following:

  • what is a typical token
  • what is a typical special token: MASK, UNK, SEP, etc
  • when do you add a token (when you want to expand your vocab)

What I don’t understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens? If an example uses a special token, why can’t a normal token achieve the same objective?

tokenizer.add_tokens(['[EOT]'], special_tokens=True)

And I also dont quite understand the following description in the source documentation.
what difference does it do to our model if we set add_special_tokens to False?

add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

As per the docs here


It means if you have added token = EOT then suppose your string contains eot i am here then eot will be treated differently if its special token.

For example, this is how it tooks

and also about first question, I dont have much idea as I never used add_tokens functionality (the work I do is mostly on NLP and here the default special tokens are enough)

interesting, in the above example, the word eot got splitted into two tokens despite EOT being a special token, does it mean that special tokens are case sensitive?

yeah. For special tokens there will no processing (like lowercase etc) so eot will be splitted but EOT will not be splitted.