I’m trying to train a language model using pytorch text then using
BPTTIterator to get an batch iterator for training. However, I’m getting an empty one (i.e. length 0).
spacy_en = spacy.load('en') def tokenizer(text): return [tok.text for tok in spacy_en.tokenizer(text)] # embedding dimension (size of the word vectors) embed_size = 300 # batch size batch_size = 64 # BPTT (backpropagation through time) length seq_len = 50 TEXT = tt.data.Field(tokenize=tokenizer, lower=True, batch_first=True) #sequential=True train, valid = tt.data.TabularDataset.splits(path=PATH, train='train.csv', validation='valid.csv', format='csv', fields=[('text', TEXT)]) TEXT.build_vocab(train, vectors='fasttext.en.300d') vocab = TEXT.vocab train_iter, valid_iter = tt.data.BPTTIterator.splits((train, valid), batch_size=batch_size, bptt_len=30)
len(valid_iter) gives 0!
My dataset is something like the following, and I’m trying to prepare it to pass it thought an RNN
A child in a pink dress is climbing up a set o... A girl going into a wooden building . A little girl climbing into a wooden playhouse . A little girl climbing the stairs to her playh... A little girl in a pink dress going into a woo...
What I’m missing here? i’m almost following this official example which works perfectly on wikipedia datasets - link