My research interest is effect of emojis in text. I am trying to classify sarcastic tweets in text. A month ago I have used a dataset where I added the tokens using:
tokenizer.add_tokens(‘List of Emojis’).
So when I tested the BERT model had successfully added the tokens. But 2 days ago when I did the same thing for another dataset, BERT model has categorized then as ‘UNK’ tokens. My question is, is there a recent change in the BERT model? I have tried it with the following tokenizer,
This is same for distilbert. It does not recognize the emojis despite explicitly adding them. At first I read somewhere there is no need to add them in the tokenizer because BERT or distilbert has already those emojis in the 30000 tokens but I tried both. By adding and without adding. For both cases it does not recognize the emojis.
What can I do to solve this issue. Your thoughts on this would be appreciated.