Hi, my question is can a tokenizer and then for example NLP summarization or something be trained on the same data?
One more question: how can i train a language that has non latin alphabet ? when i train the tokenizer can not read the sentence. it returns something like this :
[‘
’, ‘áĥ¥’, ‘áĥIJáĥłáĥĹáĥ£áĥļáĥĺ’, ‘Ġáĥĺáĥ¡’, ‘áĥ¬áĥIJáĥķáĥļ’, ‘áĥĶ’, ‘?’, ‘’]
should i map each char to a number manually ? like a:0 b:1 and etc… but a and b in my language is ა, ბ.
Talking about the Huggingface Tokenizer
Yes, you can train tokenizer and model on the same data.
1 Like