Different Dataset Share One Vocabulary

lx865712528 · July 6, 2017, 10:55am

Hi,

I want to load two text datasets (A and B) by torchtext.
And I build a vocabulary on A using the following code.

# read data
TEXT = data.Field()
LABELS = data.Field(sequential=False)

train, val, test = data.TabularDataset.splits(path=args.data,
                                          train='train.csv',
                                          validation='valid.csv',
                                          test='test.csv',
                                          format='csv',
                                          fields=[('text', TEXT), ('label', LABELS)])
train_iter, val_iter, test_iter = data.BucketIterator.splits((train, val, test),
                                                         batch_sizes=(args.batch_size,
                                                                      4 * args.batch_size,
                                                                      4 * args.batch_size),
                                                         sort_key=lambda x: len(x.text),
                                                         device=0)
TEXT.build_vocab(train.text, wv_type=args.wv_type, wv_dim=args.wv_dim)
LABELS.build_vocab(train.label)

I want to use the same vocabulary on B instead of rebuild a new one.
Is there any solutions by torchtext?

Can I dump vocab in torchtext and load-assign it?
Can I reuse the Field in torchtext?

Thanks

hughperkins · July 6, 2017, 11:13am

[bystander, newbie comment] Seems like vocab is missing a .freeze() method? What about, can you do something like TEXT.vocab.max_size = len(TEXT.vocab) ?

lx865712528 · July 6, 2017, 11:19am

I’m now doing things like following, these works, but somehow UGLY.

Elegant solution wanted.

# read data
TEXT = data.Field()
LABELS = data.Field(sequential=False)
# use dataset A to build vocab
vocab_train, _, _ = data.TabularDataset.splits(path=args.vocab,
                                               train='train.csv',
                                               validation='valid.csv',
                                               test='test.csv',
                                               format='csv',
                                               fields=[('text', TEXT), ('label', LABELS)])
TEXT.build_vocab(vocab_train.text, wv_type=args.wv_type, wv_dim=args.wv_dim)
# reuse vocab for dataset B
train, val, test = data.TabularDataset.splits(path=args.data,
                                              train='train.csv',
                                              validation='valid.csv',
                                              test='test.csv',
                                              format='csv',
                                              fields=[('text', TEXT), ('label', LABELS)])
train_iter, val_iter, test_iter = data.BucketIterator.splits((train, val, test),
                                                             batch_sizes=(args.batch_size,
                                                                          4 * args.batch_size,
                                                                          4 * args.batch_size),
                                                             sort_key=lambda x: len(x.text),
                                                             device=0)
LABELS.build_vocab(train.label)

lx865712528 · July 6, 2017, 11:26am

This line doesn’t report error. Do you mean build a Vocab first and then assign?

This build-and-assign idea seems working, I think the perfect method is dump-and-load.

hughperkins · July 6, 2017, 11:30am

Yes, build the vocab first, then freeze it. After freezing the vocab, any new, previously unseen, words, should just be sent to unk. I think. ?