Hi all,
I am going through the tutorial: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
And the next dataset is used as input there:
train_dataset, test_dataset = \
text_classification.DATASETS['AG_NEWS'](root='./.data', ngrams=NGRAMS, vocab=None)
While it downloads data as *.csv files the function returns tensor with numerical values:
train_dataset[0]
(2, tensor([ 572, 564, 2, 2326, 49106, 150, 88, 3,
1143, 14, 32, 15, 32, 16, 443749, 4,
572, 499, 17, 10, 741769, 7, 468770, 4,
52, 7019, 1050, 442, 2, 14341, 673, 141447,
326092, 55044, 7887, 411, 9870, 628642, 43, 44,
144, 145, 299709, 443750, 51274, 703, 14312, 23,
1111134, 741770, 411508, 468771, 3779, 86384, 135944, 371666,
4052]))
The model itself is described in detail, but how input text from *.csv files was translated into numerical vectors?
Could you please give me a hint about what is the most common way to translate text data in the training set into numerical tensors? I know that one common way it to one-hot encode words, but what if in the unseen data there are going to be unseen words. And actually I didn’t see this in tutorials, in most of the cases preprocessed data is just downloaded using library function and input there is already in a form of numerical vectors. How is it usually done in pytorch/torchtext?
Thanks.