Problem using different word embedding on pre-trained model


(John Richmond) #1

I have a model I am using for sentence classification. At the start I create a vocabulary of the words in the training and validation datasets and then load the Glove word vectors corresponding to these into an embedding layer (which is fixed and not updated during training)

After the embedding layery I have a relatively simple three layer net with a softmax output.

The model works fine on the training and test data and I get about 90% accuracy on both.

I then save the model using the command
torch.save(model.state_dict(), model_path)

Fine so far.

To use the model on new data which hasn’t been categorised in advance I then re-create the model and then read in the parameters as follows:

model.load_state_dict(torch.load(model_path))
model.eval()

The problem is that I now need to change the embedding layer since the vocabulary is different. The approach I have taken is to create a new vocabulary and set of pre-trained word vectors from Glove and then update the embedding layer of the model as follows:

model.embeddings=nn.Embedding(vocab_size, word_vector_length)
model.embeddings.weight.data.copy_(v.vectors)
model.eval()

When I look at the model it looks to have done everything ok and the model will run, however, the resutls are clearly in error. If I actually use the model on data I originally trained it on I get very different (and awful) results (even when the vocab and embedding layer loaded are the same as they were during training)

I have checked and the weights in all of the hidden layers still seem to be the same before and after the loading process. Am I don’t something stupid?

Many thanks

John


(Hugh Perkins) #2

I originally went off on a tangent in my reply, about ‘original research’ yada yada, but then noticed this sentence ‘(even when the vocab and embedding layer loaded are the same as they were during training)’. So, you can probably test this without even training, just using random numbers, with just like say vocab size 5 etc. This’ll be much quicker/faster/easier to get working, and find the bug. And also means as a bonus you could post a code snippet we could run and try :slight_smile:


(John Richmond) #3

Thanks Hugh, will try this and come back once I have a clearer picture of what is happening

John


(John Richmond) #4

After trying various simple models it turns out that PyTorch is doing exactly what it should and is consistent. The problem is arising because the Glove word vectors for things it does not recognise such as ‘’ seem to change each time the program is run, which then means that the inputs to the new are changing.

I can avoid this by loading the entire vocabulary for all of the words I need in advance and then training using on the part of the data for which I have category information, then running the remaining data through the model.

I would appreciate any advice as to how others tackle this problem since it can’t be unique to me

Thanks

John


(John Richmond) #5

Apologies ’ should have been the unknown symbol but it won’t show properly in the window