Hi, I am new to pytorch, here is my question.
For CoNLL03 NER task, I did the following preprocessing:
- build the vocab, idx2word, word2idx etc.
- build a pre-trained word embedding “E” in a FloatTensor of shape(vocab_size, embedding_dim). I built it according to the idx in my vocab. E.g. if “Hello” has the index 42, then E would be the embedding for “Hello”
Here is my model:
def __init__(self, embeddings, embedding_dim):
self.word_embeddings = nn.Embedding(num_embeddings=embeddings.shape, embedding_dim=embedding_dim).cuda()
#the "embeddings" here is the tensor E I mentioned above
def forward(self, inputs):
#inputs: [batch_size,seq_len], each entry is the index of that token
x = self.word_embeddings(inputs)
Here I found an issue:
In my understanding, x here should correspond to the first token of the first seq from inputs. It should be the embedding for that 1st word. However, when I printed out x, it was different from the embedding of that 1st word.
It shouldn’t be that the embedding weights were updated because I hadn’t call loss.backward() and optimizer.step() yet. Did I do anything wrong?
How do you create
embeddings? For example I use GloVe files and my custom code looking like that:
embed_mat = word_vector_loader.create_embedding_matrix('glove.6B.100d.txt', vectorizer.vocabulary.word_to_index, max_idx)
When I then look at an mebedding for a random index value:
print(vectorizer.vocabulary.index_to_word) ==> e.g., for me, "always"
I get the same vector as in the file
glove.6B.100d.txt. However, the word “always” is obvisouly not in line 105 but in line 691. Maybe this causes some discrepancy in your code. Can you check if your
embeddings is correct, e.g., that
embeddings[x] is the same as in the file for word
idx2word[x] (like x=105 in my example above)
I also set the word embeddings in the model a bit differently, but that shouldn’t make any difference
But yeah, when I have a sentence like “always …” the sequence is [105, …] and
X after the embedding layers is the vector for “always” in the GloVe file as well as
embed_mat. Just for testing, can you set
freeze=True? Or do you enforce any normalization in the embedding layer?
Hi, thank you for the reply.
I generated my embedding matrix by doing this:
def build_embedding_tensor(idx2word, vsm):
matrix = np.zeros([len(idx2word), 200], dtype=float)
for i in idx2word:
word = idx2word[i]
if word != "<PAD>":
matrix[i] = vsm.emb(word)
tensor = torch.cuda.FloatTensor(matrix)
The vsm here is the vector model that maps a input word to its embedding. I used FastText 200 embeddings. In the LSTM model, I did something like:
self.word_embeddings = nn.Embedding(num_embeddings=embed_mat.shape, embedding_dim=embed_mat.shape).cuda()
Now if I
print(embed_x), I should get the same vector as
[1,2,3,4,5], but I got a completely different vector. I tried
freeze=True but still got inconsistent embeddings.
Hm, looks rather alright to me, and your last three print statements are as expected. Maybe just some ideas:
self.word_embeddings = nn.Embedding(...).cuda() should not be needed since
from_pretrained creates an
Embedding layer and not just sets the weights, at least according to the docs. Of course, it really shouldn’t make a difference. But if all fails, comment it out for test :). It’s a bit strange though,
self.embeddings shouldn’t be moved to the GPU after
*You could give my alternative a chance just to see if there’e any difference
print(model.word_embeddings.weight) show after calling
from_pretrained(). Does it match with
emb_mat? Again, it obviously should, but not harm trying.
Maybe make your example minimal with
seq_len=1, i.e., just
[]. Maybe somewhere some shapes are not as expected.
Sorry for not being more helpful!
Hi thank you so much for the ideas. I will test them to see if anything went wrong. Truly appreciate your help!