How to use pytorch.text with character level sequence tagger

Hi everyone

I am trying to use the pytorch.text for my sequence tagger model. Basically, the model predicts the Part-Of-Speech tag for each token in the sentence. My model consists of one LSTM and one Bidirectional-LSTM. First a character based LSTM creates the word embedding for the target token by processing characters and then the second LSTM takes these word embeddings as input and process them (in both directions) to predict the POS tag for the target word.

Here the problem is that, I need to batch both the characters and words. Is there a way to do that with pytorch.text ?

Let me give an example:

Sentence: This is an example sentence .

First LSTM is going to create a word embedding for each word in the sentence W1,W2,W3,W4,W5,W6
For example, it produces W2 by processing the characters of the second word “is” which are “i” and “s”. Then, the forward LSTM (of bi-LSTM) process W1,W2 and backward LSTM (of bi-LSTM) process W6,W5,W4,W3,W2. Then I concatenate the hidden states and make a prediction.

I am not asking the model implementation. I am asking if pytorch.text is flexible enough to prepare the data for such a problem.

As far as I see, here I can not minibatch inputs for the LSTM and inputs for the bi-LSTM at the same time. I guess batching sentences with the same number of WORDS seems more feasible but I am also open to other suggestions. I would also appreciate if you show many any similar code snippets.

Hi,

(however it’s very late to be answered!!!, But I hope it would be useful for those who have the same problem)

I think torch.text does it only words or char chunks.
so you need to do it manually on char sequences which will done just in some lines of code.

First: get batches at word-level using torch text iterator which gives you [N, num_words]
Second: in each batch, you need to pad other words as long as the longest word in the current batch

  • You need to split each word to its chars, torch.text also does it, just create a dataset at char-level.

Then, find length of the longest word in the batch

For instance, let’s input sentence be: [this is an example]
torch.text at word-level will give you: [4, 3, 1, 2] (values are arbitrary)

So, in this batch, the longest word is “example” which has 7 chars.
your char vocabulary is ready, so just map each character to its corresponding index and for the rest, put .

If the word “this” is transformed as {t, h, i, s} = {22, 7, 8, 21} then you should pad it three times.

{t, h, i, s} ---- {t, h, i, s, , , }
{i, s} ---- {i, s, , , , , }

For batch size [N, NW] you should get character batches in size of [N, NW, NC]
(NW = Words No., NC = Characters No.)