Out-of-vocabulary KeyError on vocab.stoi in TorchText

I created an NLP model on dataset a with torchtext. Now I tried to generate predictions for new cases but vocab.stoi is throwing a KeyError.

# Original TEXT field
# TEXT = Field(
#     sequential=True, 
#     tokenize=tokenizer, 
#     lower=True,
#     use_vocab = True,
#     #fix_length = 200,
#     pad_first = True
# ) 
TEXT = dill.load(open("field.dill", "rb"))

unl_datafields = [('Column1', None), ('Text', TEXT), ('Column2', None)]

unl = TabularDataset(
    path='./unl.csv', 
    format='csv',
    csv_reader_params={ 'delimiter': ';' },
    skip_header=True,
    fields=unl_datafields
)

unl_iter = Iterator(
    unl,
    batch_size=64,
    shuffle=False,
    sort=False,
    sort_within_batch=False
)

next(iter(unl_iter))

The code above throws the following error:

 334         if self.use_vocab:
    335             if self.sequential:
--> 336                 arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
    337             else:
    338                 arr = [self.vocab.stoi[x] for x in arr]

KeyError: ' '

It seems like ’ ’ (a single space) cannot be found in vocab.stoi, which is suprising since the tokenizer splits on spaces. Moreover, I would like this to work with more words that are out of vocabulary. Can anyone give me some pointers on how to proceed?

I’m not sure if I understand you correctly, but a the whitespaces shouldn’t be part of your vocabulary. Whitespaces only separate words/tokens, and are not words/tokens themselves. So the error is to be expected.

The question is rather why you are trying to look up whitespaces in the first place. That shouldn’t happen. Usually you have something like:

sentence = 'This is a normal sentence, nothing special.'
tokens = tokenize(sentence, separator=' ')
# tokens = ['This', 'is', 'a', 'normal', 'sentence, ',', 'nothin', 'special', '.']

Is there any chance that you have sentence with more than one whitespaces in a row, and your tokenizer gets thrown off by that, for example:

sentence = 'This sentence has a gap     with five whitespaces.'

Maybe in this case tokens then contains whitespaces as elements. Anyway, there should be no situation where you have to look up whitespaces.

I think even for the continuous whitespace case, the tokenizer should handle it properly (a.k.a. remove multiple whitespace all together).

I agree, proper tokenizers will most likely do that properly. I was just making a crude guess why there might be lookup for a whitespace in the first place.

I certainly had similar issues since I had to implement my own tokenizer for better handling social media text where people forget whitespaces, use emojis or emoticons as sentence separators, etc. Granted the more common problem I had was empty stokens instead of tokens with one or more whitespaces, like in:

'1 2 3     '.split(' ')
# ['1', '2', '', '', '', '', '3']

If you see the basic English normalization in torchtext link - also same as fastText, the multiple space is replaced with single space first before tokenizer.

I guess you could remove the emoji as well.