Creating input for the model from the raw text

Hi all,

In the official documentation tutorial on text classification

the next data set is used

train_dataset, test_dataset = 
text_classification.DATASETS['AG_NEWS'](root='/.pytorch_datasets', ngrams=2, vocab=None)

which contains elements which are already numerical tensors and perfectly suitable for feeding into neural net


 tensor([572, 564, 2, 2326,   ...]))

How do I prepare the same input from the raw text?

So far I have managed to convert only single words into vectors (thanks for the help here )

import torch
import torch.nn as nn
from import Dataset, Example, Field
from import Iterator, BucketIterator

TEXT  = Field(sequential=True, tokenize=lambda x: x.split(), lower=True, use_vocab=True)
LABEL = Field(sequential=False, use_vocab=False)

data = [("shop street mountain is hight", "a"), 
         ("work is interesting", "b")]

FIELDS = [('text', TEXT), ('category', LABEL)]

examples = list(map(lambda x: Example.fromlist(list(x), fields=FIELDS), data))

dt = Dataset(examples, fields=FIELDS)

TEXT.build_vocab(dt, vectors="glove.6B.100d")
LABEL.build_vocab(dt, vectors="glove.6B.100d")


data_iter = Iterator(dt, batch_size=4, sort_key=lambda x: len(x))

VOCAB_SIZE = len(TEXT.vocab)
embedding = nn.Embedding(VOCAB_SIZE, 32)


But this is only for single words while in the raw text I have several sentences per label.

Can you please advise me on how to convert variable-length pieces of text into same dimensional tensors like we have in built-in data set text_classification.DATASETS['AG_NEWS']?

Hi Alex,
Please follow this tutorial :

1 Like

Or, you could load your data with a new torchtext abstraction. Text classification datasets, mentioned by you, follow the same new abstraction. It should be very straightforward to copy/paste and write your own pipeline link.