Hi, I have some problems in understanding embedding layer in PyTorch. I know that embedding layer is a lookup table with dimensions vocab_size x embedding_dim
. we can retrieve embedding vectors from embedding layer by their indices. Suppose I want to use pretrained word embedding vectors obtained from GloVe model.
This is some parts of my code:
def init(…,my_embeddings,…):
self.embedding = nn.Embedding(vocab_size, hidden_size)
self.embedding.weight.data.copy_(torch.from_numpy(my_embeddings))
def forward(input, hidden):
embedded = self.embedding(input)
my_embeddings
is a vocab_size x embedding_dim
matrix.
1- Is it correct that
vocab_size
is the count of unique words in the train dataset?
for missing words of train dataset (the words that are not included in the pretrained model) I have built an embedding vector that is the average of all of the word embedding vectors in train dataset.
I dont have problem with train dataset. But I have a problem with the test dataset.
2-Do I have to use from the same
my_embedding
as in the train phase (my_embedding
that is constructed by the train dataset)?
3- What should I do for the words from test dataset that are not included in
my_embedding
?
Thanks in advance.