Problem in understanding embedding layer with pretrained word embedding vectors

Hi, I have some problems in understanding embedding layer in PyTorch. I know that embedding layer is a lookup table with dimensions vocab_size x embedding_dim. we can retrieve embedding vectors from embedding layer by their indices. Suppose I want to use pretrained word embedding vectors obtained from GloVe model.
This is some parts of my code:

def init(…,my_embeddings,…):
self.embedding = nn.Embedding(vocab_size, hidden_size)
self.embedding.weight.data.copy_(torch.from_numpy(my_embeddings))

def forward(input, hidden):
embedded = self.embedding(input)

my_embeddings is a vocab_size x embedding_dim matrix.

1- Is it correct that vocab_size is the count of unique words in the train dataset?

for missing words of train dataset (the words that are not included in the pretrained model) I have built an embedding vector that is the average of all of the word embedding vectors in train dataset.

I dont have problem with train dataset. But I have a problem with the test dataset.

2-Do I have to use from the same my_embedding as in the train phase (my_embedding that is constructed by the train dataset)?

3- What should I do for the words from test dataset that are not included in my_embedding?

Thanks in advance.

yes, vocab_size returns unique words unless you limit vocab size to some value.
I think you have to use the same embeddings which was used in training phase.

I am not sure about 3, if you want to use the logic of average embeddings from train set to fill unknown tokens in train set.

instead, you can initialize unk_init = torch.Tensor.normal_. so that, pytorch will initialize unknown words via Gaussian distribution and this can be applied to train and test sets.

1 Like