How to handle Out-of-vocabulary token in inference using torchtext Field?

aryopg · May 21, 2020, 3:33pm

Hi guys, I am facing a problem using the torchtext package. So, in the data building phase, I created a text field using the data.Field and I build the vocabulary using training data:

shared_text_field = data.Field(sequential=True, tokenize=self.tokenizer.tokenize,
                                            init_token=self.sos_token, eos_token=self.eos_token,
                                            pad_token=self.pad_token, unk_token=self.unk_token)
shared_text_field.build_vocab(train)

Unfortunately, when I tried to use it with my test data, I got a KeyError problem (I’m sorry for the truncated error message):

File "/Users/aryopg/.local/share/virtualenvs/learning-Y_vf_ZaD/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in <listcomp>
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
KeyError: 'hardship'

Did I do something wrong or is this at the moment not supported? (A lil bit bizarre if it’s not supported yet) I’ll be very happy to provide more details. Thanks

dunefox · May 22, 2020, 8:58am

There is a similar issue here but with a whitespace being looked up: Out-of-vocabulary KeyError on vocab.stoi in TorchText
Even if the comment there is correct – that a whitespace shouldn’t really be part of your vocabulary – shouldn’t it be mapped to the unknown token by default? Unfortunately, they’re not really answering the question about oov words.

Issue without much discussion: https://github.com/pytorch/text/issues/337
… Maybe you can revive the issue or create a new one.

aryopg · May 22, 2020, 5:22pm

Thanks for the response and for pointing out those links! I am quite aware of the unresponsiveness, unfortunately
This may not be the place to ask, but is there any other library (other than torchtext) that is more robust and well-maintained? Ps: Hopefully this will ring a little bell for the developers lol

abishekchiff · May 26, 2020, 8:10am

There are bugs in torchtext, try what you did after installing it straight from the Github repository like this pip install --upgrade git+https://github.com/pytorch/text

cramraj8 · February 12, 2021, 1:58am

Are there any answers to the OOV issue so far ?