It would be unusual to tokenize entire sentences. Usually, tokens represent words, ngrams, characters or sentencepiece. Examples:
Words
[The, brown, fox, jumps, over, the lazy, dog, .]
Ngrams
[th, e, br, own, f, ox, j, ump, s, o, ver, th, e, la, zy, d, og, .]
Results may vary by text, and is determined algorithmically.
Chars
[t, h, e, _, b, …]
Sentencepiece
[The, brown fox, jumps over, the, lazy dog, .]
With that said, I prefer using words or ngrams.
A lot of people make use of pre-trained embeddings/tokenizers. However, if you go this route, just be careful which one you use because most are excessively large and uncurated. For example, GLoVe pretrained (put out by Google) has website links and spam gibberish as part of the encoding set.
I use generate_sp_model()
from here:
https://pytorch.org/text/stable/data_functional.html
That creates two files, one which is a .vocab
file and has all of the vocab in it, and can be opened and read with utf-8 encoding. Then I use that with build_vocab_from_iterator
here:
https://pytorch.org/text/stable/vocab.html
With word tokens, 20,000 words will usually cover 99.5% of all vocab and punctuation.
With ngrams, you can get more coverage with less.