Is it a correct way to build_vocab()? (torchtext)

al314 · January 31, 2020, 8:06am

Thanks!

Now I fixed my code snippet into

import torch
from torchtext.data import Dataset, Example, Field
from torchtext.data import Iterator, BucketIterator

TEXT  = Field(sequential=True, tokenize=lambda x: x.split(), 
                       lower=True, use_vocab=True)
LABEL = Field(sequential=False, use_vocab=False)

data = [("shop street mountain is hight", "a"), 
         ("work is interesting", "b")]

FIELDS = [('text', TEXT), ('category', LABEL)]

examples = list(map(lambda x: Example.fromlist(list(x), fields=FIELDS), 
                                 data))

dt = Dataset(examples, fields=FIELDS)

TEXT.build_vocab(dt, vectors="glove.6B.100d")
LABEL.build_vocab(dt, vectors="glove.6B.100d")

print(TEXT.vocab.stoi["is"])

data_iter = Iterator(dt, batch_size=4, sort_key=lambda x: len(x))

But now I have the next question )

How do I transform text data in dt or data_iter into numerical format suitable to be fed in into the model?

Now I have iterator over ‘dt’, but it contains text field as a text, not as numerical torch tensors.

As I understand TEXT field now contains mappings to tensors, but I need to use dt or data_iter as input to the model.

Update: reformulated into the question: Creating input for the model from the raw text