Can't create batches with torchtext for language modelling

sammo · August 10, 2018, 9:33am

Hi everyone,

I am attempting to do seq2seq translation, and wanted to learn how to use torch text.

I am trying to load some text from a file into a single torchtext dataset and then create an iterator for it, but am only getting one big batch returned when I try to iterate.

My code:

spacy_fr = spacy.load('fr')

def tokenize_fr(text):
    return [tok.text for tok in spacy_fr.tokenizer(text)]

FR_TEXT = data.Field(lower=True, tokenize=tokenize_fr, init_token='<sos>', eos_token='<eos>')

fm = LanguageModelingDataset("french.txt", FR_TEXT)
FR_TEXT.build_vocab(fm)
fm_iter = data.Iterator(fm, batch_size=10)
iteration = next(iter(fm_iter))

And when I run print(iteration) I get:

Variable containing:
2
131
40
⋮
4
3
3
[torch.cuda.LongTensor of size 1462273x1 (GPU 0)]

(additional info: The “french.txt” file contains 155000 lines of text separated by newline characters.)

My understanding is that I should have 10 batches, but I only get one long one. Have been racking my brain over this for ages, examining the source code, can anyone tell me what I am doing wrong?

Thank you!