Ensuring vocabulary correctness in torchtext

thvasilo · April 28, 2020, 6:50pm

Following a torchtext sentiment analysis tutorial, I have built two models using the same model architecture on two different input datasets.

Now I want to test their performance on the same test set (theory concerns aside, that’s part of my use-case), after I’ve saved the models. The training data and test data come from separate csv files.

I was stuck for a while on how to get the vocabularies in place for both to work as expected. Let me try to illustrate briefly:

# All these are defined in the tutorial
model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

# Load the saved model
model.load_state_dict(torch.load('model-1.pt'))

# Load test data

TEST_TEXT = Field(sequential=True, tokenize="basic_english", include_lengths = True)

TEST_LABEL = data.LabelField(dtype = torch.float)

test_fields = [('review', TEST_TEXT),
        ('sentiment', TEST_LABEL)]


test_reviews = TabularDataset(
    path='./test_examples.csv', format='csv',
    fields=test_fields,
    skip_header=False)

# Now I need to match the test data vocabulary to the data the model was trained on

REVIEWS_ONE_TEXT = Field(sequential=True, tokenize="basic_english", include_lengths = True)

REVIEWS_ONE_LABEL = data.LabelField(dtype = torch.float)

fields_one = [('review', REVIEWS_ONE_TEXT),
        ('sentiment', REVIEWS_ONE_LABEL)]

reviews_one = TabularDataset(
    path='./reviews-one.csv', format='csv',
    fields=fields_one,
    skip_header=True)

# Split training data into train/val, this was done at training time as well, using the same seed

import random

reviews_one_train_data, reviews_one_valid_data = reviews_one.split(
    split_ratio=[.9, .1], random_state = random.seed(SEED))

# Now I need to create the vocabs

# Create vocabs
MAX_VOCAB_SIZE = 25_000

REVIEWS_ONE_TEXT.build_vocab(reviews_one_train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

REVIEWS_ONE_LABEL.build_vocab(reviews_one_train_data)

# The I need to ensure the test data has the same vocab as what the model was trained on
TEST_TEXT.vocab = REVIEWS_ONE_TEXT.vocab

TEST_LABEL.vocab = REVIEWS_ONE_LABEL.vocab

# Only now can I call test and get correct results

# Create iterators
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
test_iterator = data.BucketIterator(test_reviews, batch_size=BATCH_SIZE, sort=False, device = device)

Now this is a lot of code that I need to write just to test the model on a csv file. If I then want to use the same test set on my second model trained on a different training set, I’ll have to do the above all over again.

My question is: Is there some quicker/better way to ensure the vocab for the test data matches the one that was used at training time? Especially considering that I’ll want to train two models on separate datasets, then use the same test set on the two models.

Note also, that if the seed in the train/val split changed, or if I hadn’t set one, this would not work, because different words would end up in the training vocab, making it impossible to reproduce at test time, if the seed is not set.