How input text is preprocessed in tutorials?

Hi all,

I am going through the tutorial:

And the next dataset is used as input there:

train_dataset, test_dataset = \  
    text_classification.DATASETS['AG_NEWS'](root='./.data', ngrams=NGRAMS, vocab=None)

While it downloads data as *.csv files the function returns tensor with numerical values:

(2, tensor([    572,     564,       2,    2326,   49106,     150,      88,       3,
            1143,      14,      32,      15,      32,      16,  443749,       4,
             572,     499,      17,      10,  741769,       7,  468770,       4,
              52,    7019,    1050,     442,       2,   14341,     673,  141447,
          326092,   55044,    7887,     411,    9870,  628642,      43,      44,
             144,     145,  299709,  443750,   51274,     703,   14312,      23,
         1111134,  741770,  411508,  468771,    3779,   86384,  135944,  371666,

The model itself is described in detail, but how input text from *.csv files was translated into numerical vectors?

Could you please give me a hint about what is the most common way to translate text data in the training set into numerical tensors? I know that one common way it to one-hot encode words, but what if in the unseen data there are going to be unseen words. And actually I didn’t see this in tutorials, in most of the cases preprocessed data is just downloaded using library function and input there is already in a form of numerical vectors. How is it usually done in pytorch/torchtext?


torchtext is the go to library to deal with text data, where we define Fields(how you want to deal your text data (tokenization etc.,)) and using vocab function of Fields, we can build mapping from text to intezers.

One hot encoding is one of the approaches to convert text data into intezers, with this approach if you encounter any text which you haven’t seen while fitting your train data, it assigns zero to all unseen variables by default.

However, word embeddings are advanced than one hot encoding approach. With Torchtext field class allows to specify special tokens (the unk_token for out-of-vocabulary words, the pad_token for padding, the eos_token for the end of a sentence, and an optional init_token for the start of the sentence),

Please go through this tutorial, which gives a deeper understanding of how text data can be dealt using torch text.

Hope this answers your question


@sai_m did a good explanation for the preprocessing process for the dataset in torchtext. Just one point add to that. The text classification datasets in torchtext apply a new abstract that you don’t use Field anymore. Take a look at the code here

1 Like