I’m trying to train a language model using pytorch text then using BPTTIterator
to get an batch iterator for training. However, I’m getting an empty one (i.e. length 0).
spacy_en = spacy.load('en')
def tokenizer(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
# embedding dimension (size of the word vectors)
embed_size = 300
# batch size
batch_size = 64
# BPTT (backpropagation through time) length
seq_len = 50
TEXT = tt.data.Field(tokenize=tokenizer, lower=True, batch_first=True) #sequential=True
train, valid = tt.data.TabularDataset.splits(path=PATH, train='train.csv', validation='valid.csv', format='csv', fields=[('text', TEXT)])
TEXT.build_vocab(train, vectors='fasttext.en.300d')
vocab = TEXT.vocab
train_iter, valid_iter = tt.data.BPTTIterator.splits((train, valid), batch_size=batch_size, bptt_len=30)
Now both len(train_iter)
and len(valid_iter)
gives 0!
My dataset is something like the following, and I’m trying to prepare it to pass it thought an RNN
A child in a pink dress is climbing up a set o...
A girl going into a wooden building .
A little girl climbing into a wooden playhouse .
A little girl climbing the stairs to her playh...
A little girl in a pink dress going into a woo...
What I’m missing here? i’m almost following this official example which works perfectly on wikipedia datasets - link