Processing in-memory text with torchtext

jacobcvt12 · September 13, 2018, 6:53pm

I have trained a model using torchtext for the data processing e.g.

TEXT = data.Field()
LABELS = data.Field()

train, val, test = data.TabularDataset.splits(
    path='/data/pos_wsj/pos_wsj', train='_train.tsv',
    validation='_dev.tsv', test='_test.tsv', format='tsv',
    fields=[('text', TEXT), ('labels', LABELS)])

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train, val, test), batch_sizes=(16, 256, 256),
    sort_key=lambda x: len(x.text), device=0)

TEXT.build_vocab(train)
LABELS.build_vocab(train)

Now I have documents coming in one at a time that will arrive in-memory to Python. I would like to use the same tokenization and numericalization to process these documents and then pass to my model to make a prediction, e.g.

new_doc = "hello world"
X = TEXT.process(new_doc)
pred = model(X)

This doesn’t work - any ideas on how I can process this in-memory text?