Is it a correct way to build_vocab()? (torchtext)

Hi all, sorry for basic question.
Is it a correct way to build_vocab()?

import torch
from torchtext.data import Dataset, Example, Field
from torchtext.data import Iterator, BucketIterator

TEXT  = Field(sequential=True, tokenize=lambda x: x.split(), lower=True)
LABEL = Field(sequential=False, use_vocab=False)

data = [("The mountain is hight", "A"), ("Work is quite interesting", "B")]
fs = [('text', LABEL), ('category', TEXT)]
examples = list(map(lambda x: Example.fromlist(list(x), fields=fs), data))
dt = Dataset(examples, fields=fs)
TEXT.build_vocab(dt, vectors="glove.6B.100d")

print(len(TEXT.vocab))

for el in data:
    tokens = el[0].split()
    print(tokens)
    for t in tokens:
        print(TEXT.vocab.stoi[t])

Asking because in my working code (above is a test) all indices are set to 0, as also in this example.

It generates vocab based only on the second entry (a.k.a. category). Therefore, any tokens in text are unk, whose id is 0. IMO

You could write something similar like build_vocab_from_iterator. The constructor of the Vocab class needs a counter of tokens. Take a look at the Vocab class here

1 Like

Thanks!

Now I fixed my code snippet into

import torch
from torchtext.data import Dataset, Example, Field
from torchtext.data import Iterator, BucketIterator

TEXT  = Field(sequential=True, tokenize=lambda x: x.split(), 
                       lower=True, use_vocab=True)
LABEL = Field(sequential=False, use_vocab=False)

data = [("shop street mountain is hight", "a"), 
         ("work is interesting", "b")]

FIELDS = [('text', TEXT), ('category', LABEL)]

examples = list(map(lambda x: Example.fromlist(list(x), fields=FIELDS), 
                                 data))

dt = Dataset(examples, fields=FIELDS)

TEXT.build_vocab(dt, vectors="glove.6B.100d")
LABEL.build_vocab(dt, vectors="glove.6B.100d")

print(TEXT.vocab.stoi["is"])

data_iter = Iterator(dt, batch_size=4, sort_key=lambda x: len(x))

But now I have the next question )

How do I transform text data in dt or data_iter into numerical format suitable to be fed in into the model?

Now I have iterator over ‘dt’, but it contains text field as a text, not as numerical torch tensors.

As I understand TEXT field now contains mappings to tensors, but I need to use dt or data_iter as input to the model.

Update: reformulated into the question: Creating input for the model from the raw text

You mean data_iter is still text now? At some point, I remember, it numericalizes text into tensor. data_iter should be ready to use to train model.

If not, you can use TEXT.vocab.stoi() to numericalize tokens as a list and convert to tensors torch.Tensor([tok_ids]).

1 Like