Hi everyone,
I am attempting to do seq2seq translation, and wanted to learn how to use torch text.
I am trying to load some text from a file into a single torchtext dataset and then create an iterator for it, but am only getting one big batch returned when I try to iterate.
My code:
spacy_fr = spacy.load('fr')
def tokenize_fr(text):
return [tok.text for tok in spacy_fr.tokenizer(text)]
FR_TEXT = data.Field(lower=True, tokenize=tokenize_fr, init_token='<sos>', eos_token='<eos>')
fm = LanguageModelingDataset("french.txt", FR_TEXT)
FR_TEXT.build_vocab(fm)
fm_iter = data.Iterator(fm, batch_size=10)
iteration = next(iter(fm_iter))
And when I run print(iteration) I get:
Variable containing:
2
131
40
⋮
4
3
3
[torch.cuda.LongTensor of size 1462273x1 (GPU 0)]
(additional info: The “french.txt” file contains 155000 lines of text separated by newline characters.)
My understanding is that I should have 10 batches, but I only get one long one. Have been racking my brain over this for ages, examining the source code, can anyone tell me what I am doing wrong?
Thank you!