Can't get rid of <eos> & <sos> tokens in Torchtext Field

andcarnivorous · May 2, 2020, 9:23am

I am following one of torchtext tutorials using the wikitext2 dataset and I cannot figure out why, even though I specify None as value for init and eos tokens I still get them in my vocabulary, also padding ones.

torchtext version: 0.4.0

This is part of my vocabulary:

            {'<unk>': 0,
             '<pad>': 1,
             'the': 2,
             ',': 3,
             '.': 4,
             'of': 5,
             'and': 6,
             'in': 7,
             'to': 8,
             '<eos>': 9,
             'a': 10,
             'was': 11,

This is instead my field set up:

TEXT = torchtext.data.Field(tokenize = get_tokenizer("basic_english"), 
                            init_token=None, 
                            eos_token=None, 
                            stop_words = ["=", "==", "@", "@@", "<", ">", "@-@"], 
                            lower=True)

train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(TEXT)
TEXT.build_vocab(train_txt)

I also get a somewhere which I cannot really explain since padding is not enabled by default but only when specifying max_length.