How to properly setup pytorch text BPTTIterator


(Bachr) #1

I’m trying to train a language model using pytorch text then using BPTTIterator to get an batch iterator for training. However, I’m getting an empty one (i.e. length 0).

spacy_en = spacy.load('en')

def tokenizer(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

# embedding dimension (size of the word vectors)
embed_size = 300
# batch size
batch_size = 64
# BPTT (backpropagation through time) length
seq_len = 50

TEXT = tt.data.Field(tokenize=tokenizer, lower=True, batch_first=True) #sequential=True

train, valid = tt.data.TabularDataset.splits(path=PATH, train='train.csv', validation='valid.csv', format='csv', fields=[('text', TEXT)])

TEXT.build_vocab(train, vectors='fasttext.en.300d')

vocab = TEXT.vocab

train_iter, valid_iter = tt.data.BPTTIterator.splits((train, valid), batch_size=batch_size, bptt_len=30)

Now both len(train_iter) and len(valid_iter) gives 0!

My dataset is something like the following, and I’m trying to prepare it to pass it thought an RNN

A child in a pink dress is climbing up a set o...
A girl going into a wooden building .
A little girl climbing into a wooden playhouse .
A little girl climbing the stairs to her playh...
A little girl in a pink dress going into a woo...

What I’m missing here? i’m almost following this official example which works perfectly on wikipedia datasets - link