Pretrained Word embeddings + on the fly unknown word embedding?

raja · February 14, 2018, 4:23am

I am looking for a way to handle unknown words on the fly in the encoder side. I want to freeze pre-trained word embeddings during training. And for words that are not in pre-trained embeddings I would like to generate custom word embeddings based on some logic (char n-gram, morphology etc.,).

embedding = nn.Embedding(num_embedding, embedding_dim)
embedding.weight = nn.Parameter(pretrained_embedding)
embedding.weight.requires_grad = False

In the forward function, while I get embedded = self.embedding(word_inputs) is there a way to do get custom representation for unknown words in the embedded tensor? Since the embedding matrix dimension is fixed I am not sure how to do this?

Any help is much appreciated.

mobius · June 30, 2018, 10:09am

So I’ve been thinking pretty much the same lately and in the light of FastText embeddings which by default produce UOV embeddings by using ngrams of words during training it makes more sense to have something in the nn.Embedding to handle such functionality.

Currently what I’ve been doing (I’m not sure is 100% correct) is the following:

def encode(self, data):
        encoded_sentences = []

        for sentence in data:
            for word in sentence:
                if word not in self.wordToIndex:
                    self.indexToWord.append(word)

                    idx = len(self.indexToWord) - 1
                    self.wordToIndex[word] = idx

                    # Get the current embeddings or create them on the fly (in your case)
                    self.vectors[idx] = self._embeddings[word]
                else:
                    idx = self.wordToIndex[word]

                encoded_sentence.append(idx)

            encoded_sentences.append(encoded_sentence)

Then the plan is to pass the .vectors property to the nn.Embedding weights. However this is really slow, considering that you have to go through every word just to create the weights to add to the nn.Embedding layer

What I don’t get is why not skip the whole word to index -> index to embedding step and have nn.Embedding handle words directly?