Hi, I am new to pytorch, here is my question.
For CoNLL03 NER task, I did the following preprocessing:
build the vocab, idx2word, word2idx etc.
build a pre-trained word embedding “E” in a FloatTensor of shape(vocab_size, embedding_dim). I built it according to the idx in my vocab. E.g. if “Hello” has the index 42, then E[42] would be the embedding for “Hello”
Here is my model:
def BiLSTM(nn.Module):
def __init__(self, embeddings, embedding_dim):
super(BiLSTM, self).__init__()
self.word_embeddings = nn.Embedding(num_embeddings=embeddings.shape[0], embedding_dim=embedding_dim).cuda()
self.word_embeddings.from_pretrained(embeddings, freeze=False)
#the "embeddings" here is the tensor E I mentioned above
...
def forward(self, inputs):
#inputs: [batch_size,seq_len], each entry is the index of that token
x = self.word_embeddings(inputs)
...
Here I found an issue:
In my understanding, x[0][0] here should correspond to the first token of the first seq from inputs. It should be the embedding for that 1st word. However, when I printed out x[0][0], it was different from the embedding of that 1st word.
It shouldn’t be that the embedding weights were updated because I hadn’t call loss.backward() and optimizer.step() yet. Did I do anything wrong?
When I then look at an mebedding for a random index value:
print(vectorizer.vocabulary.index_to_word[105]) ==> e.g., for me, "always"
print(embed_mat[105])
I get the same vector as in the file glove.6B.100d.txt. However, the word “always” is obvisouly not in line 105 but in line 691. Maybe this causes some discrepancy in your code. Can you check if your embeddings is correct, e.g., that embeddings[x] is the same as in the file for word idx2word[x] (like x=105 in my example above)
I also set the word embeddings in the model a bit differently, but that shouldn’t make any difference
But yeah, when I have a sentence like “always …” the sequence is [105, …] and X[0][0] after the embedding layers is the vector for “always” in the GloVe file as well as embed_mat[105]. Just for testing, can you set freeze=True? Or do you enforce any normalization in the embedding layer?
def build_embedding_tensor(idx2word, vsm):
matrix = np.zeros([len(idx2word), 200], dtype=float)
for i in idx2word:
word = idx2word[i]
if word != "<PAD>":
matrix[i] = vsm.emb(word)
tensor = torch.cuda.FloatTensor(matrix)
return tensor
The vsm here is the vector model that maps a input word to its embedding. I used FastText 200 embeddings. In the LSTM model, I did something like:
Now if I print(embed_x[0][0]), I should get the same vector as emb_mat[105], say [1,2,3,4,5], but I got a completely different vector. I tried freeze=True but still got inconsistent embeddings.
Hm, looks rather alright to me, and your last three print statements are as expected. Maybe just some ideas:
self.word_embeddings = nn.Embedding(...).cuda() should not be needed since from_pretrained creates an Embedding layer and not just sets the weights, at least according to the docs. Of course, it really shouldn’t make a difference. But if all fails, comment it out for test :). It’s a bit strange though, self.embeddings shouldn’t be moved to the GPU after from_pretrained().
*You could give my alternative a chance just to see if there’e any difference
What does print(model.word_embeddings.weight[105]) show after calling from_pretrained(). Does it match with emb_mat[105]? Again, it obviously should, but not harm trying.
Maybe make your example minimal with batch_size=1 and seq_len=1, i.e., just [[105]]. Maybe somewhere some shapes are not as expected.