Train a tokenizer like XLM Roberta tokenizer

huung · July 2, 2022, 4:47pm

Hi, How can I train a tokenizer like XLM Roberta tokenizer from scratch with sentencepiece.bpe.model?

I tried to use load their tokenizer and use tokenizer.train_new_from_iterator but it throw PanicException: likelihood is NAN. Input sentence may be too long.

So what sentence’s length does train_new_from_iterator allow? And I see in their repo has a sentencepiece.bpe.model but I am sure about the correct way to get a file like that.

And in XLMR’s paper, the author said that they use Sentenpiece with Unigram model. But in sentencepiece.bpe.model is bpe. So does Huggingface traind a new one with bpe model instead?

Thank you for reading!