I am following one of torchtext tutorials using the wikitext2 dataset and I cannot figure out why, even though I specify None as value for init and eos tokens I still get them in my vocabulary, also padding ones.
torchtext version: 0.4.0
This is part of my vocabulary:
{'<unk>': 0,
'<pad>': 1,
'the': 2,
',': 3,
'.': 4,
'of': 5,
'and': 6,
'in': 7,
'to': 8,
'<eos>': 9,
'a': 10,
'was': 11,
This is instead my field set up:
TEXT = torchtext.data.Field(tokenize = get_tokenizer("basic_english"),
init_token=None,
eos_token=None,
stop_words = ["=", "==", "@", "@@", "<", ">", "@-@"],
lower=True)
train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(TEXT)
TEXT.build_vocab(train_txt)
I also get a somewhere which I cannot really explain since padding is not enabled by default but only when specifying max_length.