Torchtext and <unk>, <pad>

iclementine · May 10, 2018, 8:12am

I have a question regarding when doing nlp tasks such as tagging and parsing. When I define a Field for the input sentences, I define a sequential Field, and torchtext considerably handles <unk> for me, and <pad>, <eos>, <sos> if necessary.

The question is when I define a Field for the outputs or targets, like tagging, I do not want a <unk> to be in the vocabulary in that Field because I don’t want a classifier to output <unk>. Although we can build a Field without <pad>, <eos>, or <sos>, but <unk> is always included. Is there a way to define a Field without in its vocab?

I have been doing it in two ways:

Just let it be there, be confident that the model will not prefer a <unk> output because the training data does not even contain <unk> as output.
build a softmax layer whose output_size is smaller that the output_field’s vocab size, add some code like pred = softmax_output.max(dim=-1) + 1 or gold_output = gold_output - 1.

I don’t feel it is a neat way to do that. So, does any one have ideas what is the best practice for that?

Thank you!

11177 · May 10, 2018, 8:22am

For example, You can define Field TEXT for your input sequence and Field TAG for your output.
When you build your vocabulay, you can do the following:
TEXT.build_vocab(train_dataset)
In this way, the words in your output won’t be included in you vocabulary. That’s the mechanism in torchtext. Powerful! Right?

iclementine · May 10, 2018, 8:29am

Yes, but when you are building a C-class classifier, the <unk> will make it more staightforward to build a (C+1)-class classifier. How do you handle that?

11177 · May 10, 2018, 8:36am

When you defining your Field for target, you shouldn’t treat it as a sequential Field, you just treat it as a discret label, even though it is in a sequential form. In a way, you shouldn’y build vocabulary for your target, because you are performing a classification problem, not a sequence generation problem.

iclementine · May 10, 2018, 8:42am

But in tasks like POS-tagging, the output field is sequential. I would provide

sequences: LongTensor(sequence_length, batch_size)
tags: LongTensor(sequence_length, batch_size)

to my model, if the target_field is not sequential, may be a sequence at one forward and batch inferencing would be a bit inconvenient. Or do you conventionally implement a a step at one forward model?

11177 · May 10, 2018, 8:49am

I am Chinese, my English is poor, i can’t seek a way to express my meaning. Sorry about that!

iclementine · May 10, 2018, 8:56am

thank you! may be some pseudocode would help.

defaultdict(<function torchtext.vocab._default_unk_index>,
            {'#': 40,
             '$': 21,
             "''": 26,
             '(': 35,
             ')': 34,
             ',': 8,
             '.': 11,
             ':': 27,
             '<pad>': 1,
             '<unk>': 0, # that's it! th <unk>
             'CC': 13,
             'CD': 9,
             'DT': 5,
             'EX': 37,
             'FW': 43,
             'IN': 4,
             'JJ': 7,
             'JJR': 29,
             'JJS': 32,
             'LS': 41,
             'MD': 23,
             'NN': 2,
             'NNP': 3,
             'NNPS': 44,
             'NNS': 6,
             'PDT': 45,
             'POS': 20,
             'PRP': 19,
             'PRP$': 24,
             'RB': 12,
             'RBR': 31,
             'RBS': 38,
             'RP': 30,
             'TO': 15,
             'UH': 42,
             'VB': 14,
             'VBD': 10,
             'VBG': 18,
             'VBN': 16,
             'VBP': 22,
             'VBZ': 17,
             'WDT': 28,
             'WP': 33,
             'WP$': 39,
             'WRB': 36,
             '``': 25})

May be a non-sequential Field would help. I would try to figure that out.

brobaidek · August 19, 2018, 6:27pm

I think you are looking for:
LABEL = data.Field(sequential=False, unk_token=None)

Note: I did have to update PyTorch and torchtext for this to work