It appears to me that the Hugging Face (i.e., transformers library) has a mismatched tokenizer and config with respect to vocabulary size. It appears that the RoBERTa config object lists vocabulary size at 30522 while the tokenizer has a much larger vocab. In the RoBERTa paper, 50k is what is listed as the correct value, which leads me to believe that the problem I am having is due to the config class. Additionally the documentation for the Hugging Face RoBERTa config says that it is the same as the config of BERT, which has a smaller vocabulary (~30k). I have some specific questions:
(1) How can I set up RoBERTa so that it will recognize all of the IDs resulting from its tokenizer?
(2) Less importantly, why is this implementation of RoBERTa set up this way?
I am including a toy example to illustrate my problem and facilitate simple answers:
# setting up RoBERTa from transformers import RobertaConfig, RobertaModel, RobertaTokenizer configuration = RobertaConfig() roberta = RobertaModel(configuration) tokenizer = RobertaTokenizer.from_pretrained("roberta-base") # using RoBERTa with a problematic token text = 'currency' tokenized = tokenizer.encode(text, add_special_tokens = False, # removes beginning of sentence and end of sentence tokens, just for the purpose of this toy example—this is not responsible for the error. return_tensors = 'pt') roberta(tokenized) # The KeyError is returned since the token index is greater than the vocabulary size of the config print(tokenizer.convert_tokens_to_ids(text)) # token index print(configuration.vocab_size) # vocabulary size of config # Just for comparison, here is an unproblematic token unproblematic_text = tokenizer.convert_ids_to_tokens(10004) print(unproblematic_text) encoded = tokenizer.encode(unproblematic_text, add_special_tokens = False, return_tensors = 'pt') roberta_output = roberta(encoded)