when i have vocab size of 40000 and want to embed this to 300
I use nn.Embedding(40000, 300)
then How is embeddings are trained?? Since it is not kind of word2vec task, has no label for each word.
when i have vocab size of 40000 and want to embed this to 300
I use nn.Embedding(40000, 300)
then How is embeddings are trained?? Since it is not kind of word2vec task, has no label for each word.
Embedding is not for training, itās a lookup table. You first map each word in the vocabulary to a unique integer index, and then the nn.Embedding just map this index to a vector with size of 300.
if 300 vector is not trained, that word vector does not represent relations among words. right??
I found this post very helpful. I think the nn.Embedding just initialize the lookup table, and thereafter you train it with gradient descent.
nn.Embedding
acts like a trainable lookup table.
The relations between words will be learned during its training.
This blog post might be useful to get some intuition on this layer.
The 40,000 word vectors are learned as another parameter of the network that you train. There is a lot of literature on pretraining word embeddings using LSA or the W2V algorithms, but by initializing random vectors (in this size of dimension 300), and applying updates to those with backpropogation, we can learn good approximations of such vectors that are tuned to the objective of your NN.
Thanks @ptrblck. But I have another question how does Pytorch learn these embeddings? does it use word2vec Skipgram or CBOW kind of model? I mean can you kindly provide more details?
thanks
An embedding layer is a simple lookup table accepting a sparse input (word index) which will be mapped to a dense representation (feature tensor). The embedding weight matrix will get gradients and will thus be updated. SkipGram etc. would refer to a training technique and your model might use embedding layers for it.
than, you mean default training technique of nn.embedding is SkipGram?
No, I donāt think SkipGram
is the default training technique and my previous post tried to separate the actual layer (nn.Embedding) vs. any learning technique (e.g. SkipGram
).
I think the upshot is nn.Embedding can be viewed as a map from token_id ā vector. How you get this vector is up to you. What the token is us up to you also.
All of the above embeddings are such that a unique token has a unique (universal) vector for the model. There is no notion on ācontextā. Transformers get token representations with-context. So, in this case you feed in āthe river bankā and āI robbed a bankā and you get different vector representations for the token āriverā (assuming words are the tokens here). For SkipGram, bank has just one embedding, which you learn via optimization.
Basically Embedding is a building block - your model and configuration decides what it contains and how you train it.
SkipGram is not a training technique. SkipGram or CBOW are rather network architectures designed to learn meaningful word vectors. For example, SkipGram is set up to predict the surrounding words given a target word (CBOW is vice versa). After the SkipGram model is train using a large corpus, one can extract āsome weightsā from the model that represent word vectors.
This word vectors can be downloaded and you can use them as initial weights for you nn.Embedding
layer. Still, they are just some weights. So if you donāt freeze the embedding layer, they get updated during backpropagation like any other parts of your network.
But what if my vocabulary is very large, yet I havenāt obtained the indices of all my words.
Does nn.Embedding accept indices out of the range of what was given at the initialization?
I cannot load my vocabulary all at once, which means i even donāt know its size yet, i am batching through texts and there could be new words always. How can i fit that layer with those words dynamically?
Here is a definition you can make use of and some tests showing it works:
import torch
import torch.nn as nn
def expand_emb(emb_layer, num_new_emb:int):
with torch.no_grad():
new_weights = torch.randn((num_new_emb, emb_layer.embedding_dim))
emb_layer.weight.data = torch.cat([emb_layer.weight.data,new_weights])
emb_layer.num_embeddings+=num_new_emb
return emb_layer
model = nn.Embedding(10000, 300)
print(model.weight.data.size())
dummy_inputs = torch.randint(15000,(12, 1200))
# test model, note int size is larger than allowed so we expect an error
try:
print(model(dummy_inputs).size())
except Exception as e:
print(e)
#add new embeddings
model = expand_emb(model, 5000)
#check new size
print(model.weight.data.size())
# test model again
try:
print(model(dummy_inputs).size())
print("Model successful")
except Exception as e:
print(e)
On a side note, when you add new embeddings, during a training regimen, take note that your learning rate may be lower than when the initial embeddings were trained, and thus the new embeddings may not perform as well as the original embeddings.