How to know OOV of nlp models with subword tokenization

Abdelkareem_Elkhateb · September 11, 2023, 2:43pm

How to know OOV from a model For example if i have the sentence " Kareem love Rawda but she can’t fly how to koberskobl"

and pass this sentence to tokenizer it will give me the following tokens:
[‘ka’, ‘##ree’, ‘##m’, ‘love’, ‘raw’, ‘##da’, ‘but’, ‘she’, ‘cannot’, ‘fly’, ‘how’, ‘to’, ‘do’, ‘kobe’, ‘##rsk’, ‘##ob’, ‘##l’]
and do the following code

for word in input_text.split():
if tokenizer.convert_tokens_to_ids([word]) == [tokenizer.unk_token_id]:
oov_words.append(word)

Print the OOV words
print(“Out-of-vocabulary words:”, oov_words)
It will give me Out-of-vocabulary words: [‘Kareem’, ‘Rawda’, ‘koberskobl’]
Is this the best way to do this ! i think i have some misunderstanding of the Tokenization process !
what i am trying to do is how to know if the word have a high probability to be OOV !

Abdelkareem_Elkhateb · September 11, 2023, 2:46pm

@vdw I will be happy to know your answer this is it will help a lot for improve the current arabic models i asked a lot of people, but all them say this is not possible to do with subword tokneization or general tokneizations !
can you give me any suggestion !
one of the solution that i am thinking is that i made a another dictionary that i will save all the words that i want to train my model on and do a lemmtization process before train the model ! this is really tedious any solutions or guides will help me alot and i am sorry for tagging you, but it’s important question for me

vdw · September 12, 2023, 1:09am

I don’t think you have a misunderstanding. Tokenization simply means to split a string in tokens. This can characters, but more commonly words and/or subwords. The problem with basic word tokenization is that (a) the number of unique tokens can be very large and (b) their frequency typically show a Zipf distribution – mainly, there is a long tail of many tokens (i.e., words) that occur very rarely or maybe just once. That’s never good for statistical models.

You are using subword-based tokenization. This tokenizer is trained on a dataset to split strings into words and if needed subwords based on their frequencies. Since “Kareem” was very rare in this training dataset, the tokenizer will split into more common subword. For example “ree” is in “tree”, “free”, “freedom”, “greed”, “greedy”, “breed”, etc. So the tokenizer knows “ree” just not “Kareem” as a whole. I assume that the tokenizer consider all tokens that are not complete words as OOV word, which seem intuitive :).

While a character and word-based tokenizer is relatively easy to implement using a RegEx and rule-based approach, subword tokenizers are trained over data. This it also depends in the training data (and the parameters for the training) if a certain word will be a OOV word later. So if you use a pretrained tokenizer, it’s difficult to know if a word/token will be OOV, at least as far as I can tell.

EDIT: I’m not sure what you mean by Arabic models, since your example sentence is in English. In case of a machine translation task, you would of course 2 tokenizers, one for English and one for Arabic.