I am attempting to do seq2seq translation, and wanted to learn how to use torch text.
I am trying to load some text from a file into a single torchtext dataset and then create an iterator for it, but am only getting one big batch returned when I try to iterate.
spacy_fr = spacy.load('fr') def tokenize_fr(text): return [tok.text for tok in spacy_fr.tokenizer(text)] FR_TEXT = data.Field(lower=True, tokenize=tokenize_fr, init_token='<sos>', eos_token='<eos>') fm = LanguageModelingDataset("french.txt", FR_TEXT) FR_TEXT.build_vocab(fm) fm_iter = data.Iterator(fm, batch_size=10) iteration = next(iter(fm_iter))
And when I run print(iteration) I get:
[torch.cuda.LongTensor of size 1462273x1 (GPU 0)]
(additional info: The “french.txt” file contains 155000 lines of text separated by newline characters.)
My understanding is that I should have 10 batches, but I only get one long one. Have been racking my brain over this for ages, examining the source code, can anyone tell me what I am doing wrong?