Hugging Face's RoBERTa config and tokenizer do not have matching vocabulary (?)

biobroo · October 22, 2021, 2:26pm

It appears to me that the Hugging Face (i.e., transformers library) has a mismatched tokenizer and config with respect to vocabulary size. It appears that the RoBERTa config object lists vocabulary size at 30522 while the tokenizer has a much larger vocab. In the RoBERTa paper, 50k is what is listed as the correct value, which leads me to believe that the problem I am having is due to the config class. Additionally the documentation for the Hugging Face RoBERTa config says that it is the same as the config of BERT, which has a smaller vocabulary (~30k). I have some specific questions:

(1) How can I set up RoBERTa so that it will recognize all of the IDs resulting from its tokenizer?
(2) Less importantly, why is this implementation of RoBERTa set up this way?

I am including a toy example to illustrate my problem and facilitate simple answers:

# setting up RoBERTa
from transformers import RobertaConfig, RobertaModel, RobertaTokenizer
configuration = RobertaConfig()
roberta = RobertaModel(configuration)
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

# using RoBERTa with a problematic token
text = 'currency'
tokenized = tokenizer.encode(text, 
                           add_special_tokens = False,    # removes beginning of sentence and end of sentence tokens, just for the purpose of this toy example—this is not responsible for the error.
                           return_tensors = 'pt')
roberta(tokenized)

# The KeyError is returned since the token index is greater than the vocabulary size of the config
print(tokenizer.convert_tokens_to_ids(text))   # token index
print(configuration.vocab_size)    # vocabulary size of config

# Just for comparison, here is an unproblematic token
unproblematic_text = tokenizer.convert_ids_to_tokens(10004)
print(unproblematic_text)
encoded = tokenizer.encode(unproblematic_text, 
                           add_special_tokens = False,
                           return_tensors = 'pt')
roberta_output = roberta(encoded)