Transformer does not recognize emojis despite adding them as additional tokens

the_coder · October 6, 2021, 2:16pm

My research interest is effect of emojis in text. I am trying to classify sarcastic tweets in text. A month ago I have used a dataset where I added the tokens using:

tokenizer.add_tokens(‘List of Emojis’).

So when I tested the BERT model had successfully added the tokens. But 2 days ago when I did the same thing for another dataset, BERT model has categorized then as ‘UNK’ tokens. My question is, is there a recent change in the BERT model? I have tried it with the following tokenizer,

BertTokenizer.from_pretrained(‘bert-base-uncased’)

This is same for distilbert. It does not recognize the emojis despite explicitly adding them. At first I read somewhere there is no need to add them in the tokenizer because BERT or distilbert has already those emojis in the 30000 tokens but I tried both. By adding and without adding. For both cases it does not recognize the emojis.

What can I do to solve this issue. Your thoughts on this would be appreciated.