Creating input for the model from the raw text

Hi all,

In the official documentation tutorial on text classification

https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

the next data set is used

train_dataset, test_dataset = 
text_classification.DATASETS['AG_NEWS'](root='/.pytorch_datasets', ngrams=2, vocab=None)

which contains elements which are already numerical tensors and perfectly suitable for feeding into neural net

train_dataset[0]

(2,
 tensor([572, 564, 2, 2326,   ...]))

How do I prepare the same input from the raw text?

So far I have managed to convert only single words into vectors (thanks for the help here )

import torch
import torch.nn as nn
from torchtext.data import Dataset, Example, Field
from torchtext.data import Iterator, BucketIterator
# https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

TEXT  = Field(sequential=True, tokenize=lambda x: x.split(), lower=True, use_vocab=True)
LABEL = Field(sequential=False, use_vocab=False)

data = [("shop street mountain is hight", "a"), 
         ("work is interesting", "b")]

FIELDS = [('text', TEXT), ('category', LABEL)]

examples = list(map(lambda x: Example.fromlist(list(x), fields=FIELDS), data))

dt = Dataset(examples, fields=FIELDS)

TEXT.build_vocab(dt, vectors="glove.6B.100d")
LABEL.build_vocab(dt, vectors="glove.6B.100d")

print(TEXT.vocab.stoi["is"])

data_iter = Iterator(dt, batch_size=4, sort_key=lambda x: len(x))

VOCAB_SIZE = len(TEXT.vocab)
embedding = nn.Embedding(VOCAB_SIZE, 32)

print(embedding(torch.tensor(TEXT.vocab.stoi["is"])))

But this is only for single words while in the raw text I have several sentences per label.

Can you please advise me on how to convert variable-length pieces of text into same dimensional tensors like we have in built-in data set text_classification.DATASETS['AG_NEWS']?

Hi Alex,
Please follow this tutorial : https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

1 Like

Or, you could load your data with a new torchtext abstraction. Text classification datasets, mentioned by you, follow the same new abstraction. It should be very straightforward to copy/paste and write your own pipeline link.