Tokenization of Multiple Languages

Hi, I’m about to train a new Huggingface tokenizer. The thing is that my data has three different languages. Do i need to tokenize three different language corpora? or how do I handle that?

And one more question, if i don’t train my tokenizer on three different languages, and i trained that only in the dominant language of the data, how would i handle other languages while testing? or how do i handle it when totally different language comes in? other than these 3.

If the ones except the dominant language occur frequently, you may think about using language identification. I think it depends on the accuracy of the system. If it is accurate enough you may just use the tokenizer of dominant one. However, if you need to provide more accuracy, you need to think about integrating the other tokenizers into the system.

In traditional machine learning models, you may produce unigram features for each language if you apply a language-specific tokenizer. As we think about deep learning models, there may not be specific embeddings for unknown words in different languages. So, they may be processed as the ones like unknown (out of vocabulary) even you tokenize and process the words belonging to different languages. If the root forms are similar for various languages, I think the models like BERT may perceive the foreign word as a similar one that the model have seen.


I’m actually going to train the data With Random Forest. so i’m gonna google that unigram tokenization. Maybe this.

But i wanted to do tokenization “from scratch”, meaning, i want to tokenize each letter individually: {a:0,b:1,c:2…}. So i’m looking for articles on this kind of tokenization. I don’t know how it is called to google. And one more thing, if i go this way, how would i tokenize different languages? do i just need alphabets of these 3 languages ? and how would i tokenize emojis and punctuation marks?

If you want to try traditional bag-of-words approach, you may use scikit-learn and its TfidfVectorizer. I haven’t used it yet but you can probably implement bag-of-words approach using unigram features with a few lines of code. If the languages you mentioned use latin alphabet, it is possible to benefit from spaces and whitespace characters for tokenization. You may replace some whitespaces in text with space character initially. Then, it is possible to tokenize. However, if the languages you mentioned do not use latin alphabet or same alphabet set, you need to search on Google for ideal solution for tokenization.

Pytorch has a tutorial on this, including CBOW and N-grams:

Sentencepiece tokenizers can be found here:

And N-gram tokenizers here:

If you make use of torchtext’s generate_sp_model and set the model_type='unigram', you can run it on your corpus and generate files that contain the unigrams(bpe, char and word are other model types). It’s very quick and flexible, in that you can specify the max number of vocab(this just means optimized combinations of letters or words).

Then you can make use of load_sp_model on the files you generated.

To turn text into numbers, you can use sentencepiece_numericalizer on the words you wish to feed to the model.

And when you get numbers out of the model, convert it back to text with sentencepiece_tokenizer.

Those can all be accessed here: — Torchtext 0.13.0 documentation


What i’m looking for now is that when new word comes while testing, new unknown word for the tokenizer, i want to still tokenize that and then look for similar word in the tokenizers embedding space.

That is why i want to tokenize my word “by hand”. with plain python. to have a dict like this {a:0,b:1…}
and when new word comes in for example “add”(suppose this word is unknown for a tokenizer) it will still be tokenized like this “044” and will be looking for a similar word in embedding space.

IDK if i’m clear or not.


i’m Checking that.

  1. If you’re using an n-gram or word model, you can pass the new word in with a designated “unknown” token. In fact, during training, you can mask random words/ngrams with this unknown token. Google found this increases model robustness, just like augmentations for vision models(though, don’t make me find the paper). This deals with new unknown words or typos, without needing to retokenize or train. The model has to infer the meaning from the other words.
  2. If you set the model_type='char', you can feed the model base characters. In such a case, you can still use masking, as mentioned above, along with random flipping of adjacent letters(i.e. simulate typos during training).

Thank you. I’ll try that. I may just train huggingface tokenizer on a new language.