334 if self.use_vocab:
335 if self.sequential:
--> 336 arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
337 else:
338 arr = [self.vocab.stoi[x] for x in arr]
KeyError: ' '
It seems like ’ ’ (a single space) cannot be found in vocab.stoi, which is suprising since the tokenizer splits on spaces. Moreover, I would like this to work with more words that are out of vocabulary. Can anyone give me some pointers on how to proceed?
I’m not sure if I understand you correctly, but a the whitespaces shouldn’t be part of your vocabulary. Whitespaces only separate words/tokens, and are not words/tokens themselves. So the error is to be expected.
The question is rather why you are trying to look up whitespaces in the first place. That shouldn’t happen. Usually you have something like:
sentence = 'This is a normal sentence, nothing special.'
tokens = tokenize(sentence, separator=' ')
# tokens = ['This', 'is', 'a', 'normal', 'sentence, ',', 'nothin', 'special', '.']
Is there any chance that you have sentence with more than one whitespaces in a row, and your tokenizer gets thrown off by that, for example:
sentence = 'This sentence has a gap with five whitespaces.'
Maybe in this case tokens then contains whitespaces as elements. Anyway, there should be no situation where you have to look up whitespaces.
I agree, proper tokenizers will most likely do that properly. I was just making a crude guess why there might be lookup for a whitespace in the first place.
I certainly had similar issues since I had to implement my own tokenizer for better handling social media text where people forget whitespaces, use emojis or emoticons as sentence separators, etc. Granted the more common problem I had was empty stokens instead of tokens with one or more whitespaces, like in:
If you see the basic English normalization in torchtext link - also same as fastText, the multiple space is replaced with single space first before tokenizer.