Structure of weight matrix in torch.nn.Embedding layer

I have a text dataset that there are scores for all of its sentences. I want to do a sentence regression task. I have word embedding vectors for each of words in the sentences. Now I want to use Pytorch for defining an embedding layer. I now that I should use of these line of code:

import torch as nn
embed=nn.Embedding(num_embeds,embed_dim)
#pretrained weight is a numpy matrix of shape(num_embeds,embed_dim)
embed.weight.data.copy_(torch.from_numpy(pretrained_weight))

But I don’t know what is the order of rows in pretrained_weight. Is it the same order the sequent words appear in the sequent sentences? Does it contain duplicate words?

Hi

First, as a side note, nn.Embedding.from_pretrained is convenient if you have pretrained weights.

The order of rows in pretrained_weights is dependent of the vocabulary-to-ID mappings.
If you use torchtext library, you can see the order in vocab object and there would be no duplicates.

Hi. I use a method for constructing embedding matrix based on vacab words indexes:

def load_embeddings(CorpusPretrainedEmbdDict,word2idx,embedding_dim):
    embeddings=np.zeros((len(word2idx),embedding_dim))
    for word,index in word2idx.items():
        vector = np.array(CorpusPretrainedEmbdDict[word], dtype='float32')
        embeddings[index]=vector
        return embeddings

I also construct word2idx to get indexes of all the words in my vocabulary. Then I use load_embeddings for constructing embedding matrix for the words in my vocabulary based on word2idx.

word2idx = {word: idx for idx, word in enumerate(DUC_vocab)}
#create embedding matrix for our vocab (based on word2idx)
my_embeddings=load_embeddings(CorpusPretrainedEmbdDict,word2idx,n_dim)
#Create an embedding layer
embed=nn.Embedding(len(word2idx),n_dim)
#feed our pretrained word vectors to embedding layer
embed.weight.data.copy_(torch.from_numpy(my_embeddings))

Could you please guide me this way is true? Thanks

Great, and I think this will work as you expect.

1 Like

Sorry, I have another question. I want to do a text classification task. I have prepared embedding layer as discussed above. But the sentences in the corpus have different lengths and I think that I need to zero padding. But I’m confused if I want to do zero padding, the input for embedding layer that we discussed about it in previous posts will be wrong?
thanks.

When I look at the actual weight matrix in the Embedding object, it has one more entry than the length of my vocabulary. Is the first one (or last one) for unknown words?