How to properly setup pytorch text BPTTIterator

bachir · February 9, 2019, 6:35pm

I’m trying to train a language model using pytorch text then using BPTTIterator to get an batch iterator for training. However, I’m getting an empty one (i.e. length 0).

spacy_en = spacy.load('en')

def tokenizer(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

# embedding dimension (size of the word vectors)
embed_size = 300
# batch size
batch_size = 64
# BPTT (backpropagation through time) length
seq_len = 50

TEXT = tt.data.Field(tokenize=tokenizer, lower=True, batch_first=True) #sequential=True

train, valid = tt.data.TabularDataset.splits(path=PATH, train='train.csv', validation='valid.csv', format='csv', fields=[('text', TEXT)])

TEXT.build_vocab(train, vectors='fasttext.en.300d')

vocab = TEXT.vocab

train_iter, valid_iter = tt.data.BPTTIterator.splits((train, valid), batch_size=batch_size, bptt_len=30)

Now both len(train_iter) and len(valid_iter) gives 0!

My dataset is something like the following, and I’m trying to prepare it to pass it thought an RNN

A child in a pink dress is climbing up a set o...
A girl going into a wooden building .
A little girl climbing into a wooden playhouse .
A little girl climbing the stairs to her playh...
A little girl in a pink dress going into a woo...

What I’m missing here? i’m almost following this official example which works perfectly on wikipedia datasets - link

bachir · July 29, 2019, 12:14am

It seems that BPTTIterator expects a dataset of one example, so I transformed my training dataset as follows:

def dataset2example(dataset):
    examples = list(map(lambda example: ['_bos_']+ example.text + ['_eos_'], dataset.examples))
    examples = [item for example in examples for item in example]
    example = tt.data.Example()
    setattr(example, 'text', examples)
    return tt.data.Dataset([example], fields={'text': TEXT})

train_example = dataset2example(train)
valid_example = dataset2example(valid)

Now I can pass those train_example to the BPTTIterator like this:

train_iter, valid_iter = tt.data.BPTTIterator.splits((train_example, valid_example), batch_size=batch_size, bptt_len=30)