I have a question regarding when doing nlp tasks such as tagging and parsing. When I define a Field for the input sentences, I define a sequential Field, and torchtext considerably handles <unk> for me, and <pad>, <eos>, <sos> if necessary.
The question is when I define a Field for the outputs or targets, like tagging, I do not want a <unk> to be in the vocabulary in that Field because I don’t want a classifier to output <unk>. Although we can build a Field without <pad>, <eos>, or <sos>, but <unk> is always included. Is there a way to define a Field without in its vocab?
I have been doing it in two ways:
-
Just let it be there, be confident that the model will not prefer a <unk> output because the training data does not even contain <unk> as output.
-
build a softmax layer whose output_size is smaller that the output_field’s vocab size, add some code like
pred = softmax_output.max(dim=-1) + 1
orgold_output = gold_output - 1
.
I don’t feel it is a neat way to do that. So, does any one have ideas what is the best practice for that?
Thank you!