Torchtext and <unk>, <pad>

I have a question regarding when doing nlp tasks such as tagging and parsing. When I define a Field for the input sentences, I define a sequential Field, and torchtext considerably handles <unk> for me, and <pad>, <eos>, <sos> if necessary.

The question is when I define a Field for the outputs or targets, like tagging, I do not want a <unk> to be in the vocabulary in that Field because I don’t want a classifier to output <unk>. Although we can build a Field without <pad>, <eos>, or <sos>, but <unk> is always included. Is there a way to define a Field without in its vocab?

I have been doing it in two ways:

  1. Just let it be there, be confident that the model will not prefer a <unk> output because the training data does not even contain <unk> as output.

  2. build a softmax layer whose output_size is smaller that the output_field’s vocab size, add some code like pred = softmax_output.max(dim=-1) + 1 or gold_output = gold_output - 1.

I don’t feel it is a neat way to do that. So, does any one have ideas what is the best practice for that?

Thank you!

For example, You can define Field TEXT for your input sequence and Field TAG for your output.
When you build your vocabulay, you can do the following:
TEXT.build_vocab(train_dataset)
In this way, the words in your output won’t be included in you vocabulary. That’s the mechanism in torchtext. Powerful! Right?

Yes, but when you are building a C-class classifier, the <unk> will make it more staightforward to build a (C+1)-class classifier. How do you handle that?

When you defining your Field for target, you shouldn’t treat it as a sequential Field, you just treat it as a discret label, even though it is in a sequential form. In a way, you shouldn’y build vocabulary for your target, because you are performing a classification problem, not a sequence generation problem.

But in tasks like POS-tagging, the output field is sequential. I would provide

sequences: LongTensor(sequence_length, batch_size)
tags: LongTensor(sequence_length, batch_size)

to my model, if the target_field is not sequential, may be a sequence at one forward and batch inferencing would be a bit inconvenient. Or do you conventionally implement a a step at one forward model?

I am Chinese, my English is poor, i can’t seek a way to express my meaning. Sorry about that!:sweat:

thank you! may be some pseudocode would help.

defaultdict(<function torchtext.vocab._default_unk_index>,
            {'#': 40,
             '$': 21,
             "''": 26,
             '(': 35,
             ')': 34,
             ',': 8,
             '.': 11,
             ':': 27,
             '<pad>': 1,
             '<unk>': 0, # that's it! th <unk>
             'CC': 13,
             'CD': 9,
             'DT': 5,
             'EX': 37,
             'FW': 43,
             'IN': 4,
             'JJ': 7,
             'JJR': 29,
             'JJS': 32,
             'LS': 41,
             'MD': 23,
             'NN': 2,
             'NNP': 3,
             'NNPS': 44,
             'NNS': 6,
             'PDT': 45,
             'POS': 20,
             'PRP': 19,
             'PRP$': 24,
             'RB': 12,
             'RBR': 31,
             'RBS': 38,
             'RP': 30,
             'TO': 15,
             'UH': 42,
             'VB': 14,
             'VBD': 10,
             'VBG': 18,
             'VBN': 16,
             'VBP': 22,
             'VBZ': 17,
             'WDT': 28,
             'WP': 33,
             'WP$': 39,
             'WRB': 36,
             '``': 25})

May be a non-sequential Field would help. I would try to figure that out.

I think you are looking for:
LABEL = data.Field(sequential=False, unk_token=None)

Note: I did have to update PyTorch and torchtext for this to work

2 Likes