Word embedding not consistent when using nn.Embedding()

plus1sec · April 9, 2019, 5:29am

Hi, I am new to pytorch, here is my question.
For CoNLL03 NER task, I did the following preprocessing:

build the vocab, idx2word, word2idx etc.
build a pre-trained word embedding “E” in a FloatTensor of shape(vocab_size, embedding_dim). I built it according to the idx in my vocab. E.g. if “Hello” has the index 42, then E[42] would be the embedding for “Hello”

Here is my model:

def BiLSTM(nn.Module):
    def __init__(self, embeddings, embedding_dim):
        super(BiLSTM, self).__init__()
        self.word_embeddings = nn.Embedding(num_embeddings=embeddings.shape[0], embedding_dim=embedding_dim).cuda()
        self.word_embeddings.from_pretrained(embeddings, freeze=False)
        #the "embeddings" here is the tensor E I mentioned above
        ...
    def forward(self, inputs):
        #inputs: [batch_size,seq_len], each entry is the index of that token
        x = self.word_embeddings(inputs)  
        ...

Here I found an issue:
In my understanding, x[0][0] here should correspond to the first token of the first seq from inputs. It should be the embedding for that 1st word. However, when I printed out x[0][0], it was different from the embedding of that 1st word.

It shouldn’t be that the embedding weights were updated because I hadn’t call loss.backward() and optimizer.step() yet. Did I do anything wrong?

vdw · April 9, 2019, 11:18am

How do you create embeddings? For example I use GloVe files and my custom code looking like that:

embed_mat = word_vector_loader.create_embedding_matrix('glove.6B.100d.txt', vectorizer.vocabulary.word_to_index, max_idx)

When I then look at an mebedding for a random index value:

print(vectorizer.vocabulary.index_to_word[105]) ==> e.g., for me, "always"
print(embed_mat[105])

I get the same vector as in the file glove.6B.100d.txt. However, the word “always” is obvisouly not in line 105 but in line 691. Maybe this causes some discrepancy in your code. Can you check if your embeddings is correct, e.g., that embeddings[x] is the same as in the file for word idx2word[x] (like x=105 in my example above)

I also set the word embeddings in the model a bit differently, but that shouldn’t make any difference

model.word_embeddings.weight.data.copy_(torch.from_numpy(embed_mat))
model.word_embeddings.weight.requires_grad=False

But yeah, when I have a sentence like “always …” the sequence is [105, …] and X[0][0] after the embedding layers is the vector for “always” in the GloVe file as well as embed_mat[105]. Just for testing, can you set freeze=True? Or do you enforce any normalization in the embedding layer?

plus1sec · April 9, 2019, 6:10pm

Hi, thank you for the reply.

I generated my embedding matrix by doing this:

def build_embedding_tensor(idx2word, vsm):
    matrix = np.zeros([len(idx2word), 200], dtype=float)
    for i in idx2word:
        word = idx2word[i]
        if word != "<PAD>":
            matrix[i] = vsm.emb(word)
    tensor = torch.cuda.FloatTensor(matrix)
    return tensor

The vsm here is the vector model that maps a input word to its embedding. I used FastText 200 embeddings. In the LSTM model, I did something like:

embed_mat=build_embedding_tensor(idx2word,vsm)
self.word_embeddings = nn.Embedding(num_embeddings=embed_mat.shape[0], embedding_dim=embed_mat.shape[1]).cuda()
self.word_embeddings.from_pretrained(embed_mat, freeze=False)

print(word2idx['always']) #105
print(emb_mat[105]) #[1,2,3,4,5]
print(x) #[[105,...],[...],[...]]
embed_x=self.word_embeddings(x)

Now if I print(embed_x[0][0]), I should get the same vector as emb_mat[105], say [1,2,3,4,5], but I got a completely different vector. I tried freeze=True but still got inconsistent embeddings.

vdw · April 10, 2019, 8:49am

Hm, looks rather alright to me, and your last three print statements are as expected. Maybe just some ideas:

self.word_embeddings = nn.Embedding(...).cuda() should not be needed since from_pretrained creates an Embedding layer and not just sets the weights, at least according to the docs. Of course, it really shouldn’t make a difference. But if all fails, comment it out for test :). It’s a bit strange though, self.embeddings shouldn’t be moved to the GPU after from_pretrained().
*You could give my alternative a chance just to see if there’e any difference
```
self.word_embeddings.weight.data.copy_(torch.from_numpy(embed_mat))
self.word_embeddings.weight.requires_grad=False/True
```
What does print(model.word_embeddings.weight[105]) show after calling from_pretrained(). Does it match with emb_mat[105]? Again, it obviously should, but not harm trying.
Maybe make your example minimal with batch_size=1 and seq_len=1, i.e., just [[105]]. Maybe somewhere some shapes are not as expected.

Sorry for not being more helpful!

plus1sec · April 13, 2019, 4:06am

Hi thank you so much for the ideas. I will test them to see if anything went wrong. Truly appreciate your help!