Hi, How can I train a tokenizer like XLM Roberta tokenizer from scratch with
I tried to use load their tokenizer and use
tokenizer.train_new_from_iterator but it throw
PanicException: likelihood is NAN. Input sentence may be too long.
So what sentence’s length does
train_new_from_iterator allow? And I see in their repo has a
sentencepiece.bpe.model but I am sure about the correct way to get a file like that.
And in XLMR’s paper, the author said that they use Sentenpiece with Unigram model. But in
bpe. So does Huggingface traind a new one with bpe model instead?
Thank you for reading!