Hi, How can I train a tokenizer like XLM Roberta tokenizer from scratch with sentencepiece.bpe.model
?
I tried to use load their tokenizer and use tokenizer.train_new_from_iterator
but it throw PanicException: likelihood is NAN. Input sentence may be too long.
So what sentence’s length does train_new_from_iterator
allow? And I see in their repo has a sentencepiece.bpe.model
but I am sure about the correct way to get a file like that.
And in XLMR’s paper, the author said that they use Sentenpiece with Unigram model. But in sentencepiece.bpe.model
is bpe
. So does Huggingface traind a new one with bpe model instead?
Thank you for reading!