Need some good resources to learn how to make custom vocab

IneedMrmeeseeks · February 6, 2023, 7:40am

Hi, torch friends.
I am a novice in this nlp things. I want to start with making my own vocab using my data.
The problem is about OOV(out of vocabulary). I really hard to find an answer.
To concrete my question, I use an example. (I may be wrong)

id	tweets
0	sentence 1
1	sentence 2
2	sentence 3

Let’s say I’ve created a vocab with only sentence1 and 2.
what if there are new words in sentence 3?
Then, How can I update my prior vocab?

I would also appreciate it if let me know what tokenizer is best for a torch user and links for studying. (torchtext, gensim, nltk etc.)

J_Johnson · February 6, 2023, 8:14am

It would be unusual to tokenize entire sentences. Usually, tokens represent words, ngrams, characters or sentencepiece. Examples:

Words
[The, brown, fox, jumps, over, the lazy, dog, .]

Ngrams
[th, e, br, own, f, ox, j, ump, s, o, ver, th, e, la, zy, d, og, .]
Results may vary by text, and is determined algorithmically.

Chars
[t, h, e, _, b, …]

Sentencepiece
[The, brown fox, jumps over, the, lazy dog, .]

With that said, I prefer using words or ngrams.

A lot of people make use of pre-trained embeddings/tokenizers. However, if you go this route, just be careful which one you use because most are excessively large and uncurated. For example, GLoVe pretrained (put out by Google) has website links and spam gibberish as part of the encoding set.

I use generate_sp_model() from here:

https://pytorch.org/text/stable/data_functional.html

That creates two files, one which is a .vocab file and has all of the vocab in it, and can be opened and read with utf-8 encoding. Then I use that with build_vocab_from_iterator here:

https://pytorch.org/text/stable/vocab.html

With word tokens, 20,000 words will usually cover 99.5% of all vocab and punctuation.

With ngrams, you can get more coverage with less.

IneedMrmeeseeks · February 6, 2023, 8:18am

Yes, what I meant was words token in the sentences!!
Thx for the usecases and links!!

vdw · February 7, 2023, 2:18am

When creating a vocabulary from a corpus it’s quite common to add some “special” tokens to that vocabulary. Which special tokens are needed depends on the task, but for example Sequence-to-Sequence models need a special <SOS> (start of sequence) and <EOS> (end of sequence) token – it doesn’t matter what they are called, the tokens just have to be unique.

A very common special token would be something like <UNK> (unknown) to represent any word/token never seen before. This means that when you have a news sentence, and this sentence contains words not in your vocabulary, these unknown words get (internally) replaced by <UNK> which is in your vocabulary.

The problem is not that you cannot extend you vocabulary but a model will be trained on the initial vocabulary. And since, e.g., the embedding layer will depend on the size of the vocabulary, add new words would mean changing this layer.

IneedMrmeeseeks · February 7, 2023, 4:19am

Thx for the reply.
How about making the size of the vocab dynamic?
Like nn.Embedding(len(vocab), embedding_size)

Okay, Let’s say I encounter <UNK> tokens. Next time I want to include those <UNK> tokens as known words.
In other words, is there a way that I can add new words to the initial vocab?
Even if perhaps I have to train embedding all over again

Thx!

J_Johnson · February 7, 2023, 4:30am

The embedding layer is basically just a matrix which is num_tokens by num_embeddings. Suppose you give your embedding layer 100 embedding units, i.e. 100 dimensions on which each word can be assigned meaning(i.e. wet/dry, hot/cold, person/object, etc.). When you send in the number assigned for your word, let’s say the word “the” is 10, then the embedding layer just finds the 10th vector in the matrix and replaces the integer.

You can always append more vectors to your embedding layer. But your model will need to finetune on the new vocabulary.