Hi all,
In the official documentation tutorial on text classification
https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
the next data set is used
train_dataset, test_dataset =
text_classification.DATASETS['AG_NEWS'](root='/.pytorch_datasets', ngrams=2, vocab=None)
which contains elements which are already numerical tensors and perfectly suitable for feeding into neural net
train_dataset[0]
(2,
tensor([572, 564, 2, 2326, ...]))
How do I prepare the same input from the raw text?
So far I have managed to convert only single words into vectors (thanks for the help here )
import torch
import torch.nn as nn
from torchtext.data import Dataset, Example, Field
from torchtext.data import Iterator, BucketIterator
# https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
TEXT = Field(sequential=True, tokenize=lambda x: x.split(), lower=True, use_vocab=True)
LABEL = Field(sequential=False, use_vocab=False)
data = [("shop street mountain is hight", "a"),
("work is interesting", "b")]
FIELDS = [('text', TEXT), ('category', LABEL)]
examples = list(map(lambda x: Example.fromlist(list(x), fields=FIELDS), data))
dt = Dataset(examples, fields=FIELDS)
TEXT.build_vocab(dt, vectors="glove.6B.100d")
LABEL.build_vocab(dt, vectors="glove.6B.100d")
print(TEXT.vocab.stoi["is"])
data_iter = Iterator(dt, batch_size=4, sort_key=lambda x: len(x))
VOCAB_SIZE = len(TEXT.vocab)
embedding = nn.Embedding(VOCAB_SIZE, 32)
print(embedding(torch.tensor(TEXT.vocab.stoi["is"])))
But this is only for single words while in the raw text I have several sentences per label.
Can you please advise me on how to convert variable-length pieces of text into same dimensional tensors like we have in built-in data set text_classification.DATASETS['AG_NEWS']
?