Can a tokenizer and then model be trained on same data?

Hi, my question is can a tokenizer and then for example NLP summarization or something be trained on the same data?

One more question: how can i train a language that has non latin alphabet ? when i train the tokenizer can not read the sentence. it returns something like this :

[‘’, ‘áĥ¥’, ‘áĥIJáĥłáĥĹáĥ£áĥļáĥĺ’, ‘Ġáĥĺáĥ¡’, ‘áĥ¬áĥIJáĥķáĥļ’, ‘áĥĶ’, ‘?’, ‘’]


should i map each char to a number manually ? like a:0 b:1 and etc… but a and b in my language is ა, ბ.

Talking about the Huggingface Tokenizer

Yes, you can train tokenizer and model on the same data.

1 Like