How nn.Embedding trained?

when i have vocab size of 40000 and want to embed this to 300

I use nn.Embedding(40000, 300)

then How is embeddings are trained?? Since it is not kind of word2vec task, has no label for each word.

2 Likes

Embedding is not for training, itā€™s a lookup table. You first map each word in the vocabulary to a unique integer index, and then the nn.Embedding just map this index to a vector with size of 300.

1 Like

if 300 vector is not trained, that word vector does not represent relations among words. right??

1 Like

https://stackoverflow.com/questions/44881999/word-embedding-lookuptable-word-embedding-visualizations

I found this post very helpful. I think the nn.Embedding just initialize the lookup table, and thereafter you train it with gradient descent.

2 Likes

nn.Embedding acts like a trainable lookup table.
The relations between words will be learned during its training.
This blog post might be useful to get some intuition on this layer.

7 Likes

The 40,000 word vectors are learned as another parameter of the network that you train. There is a lot of literature on pretraining word embeddings using LSA or the W2V algorithms, but by initializing random vectors (in this size of dimension 300), and applying updates to those with backpropogation, we can learn good approximations of such vectors that are tuned to the objective of your NN.

2 Likes

Thanks @ptrblck. But I have another question how does Pytorch learn these embeddings? does it use word2vec Skipgram or CBOW kind of model? I mean can you kindly provide more details?
thanks

1 Like

An embedding layer is a simple lookup table accepting a sparse input (word index) which will be mapped to a dense representation (feature tensor). The embedding weight matrix will get gradients and will thus be updated. SkipGram etc. would refer to a training technique and your model might use embedding layers for it.

2 Likes

than, you mean default training technique of nn.embedding is SkipGram?

1 Like

No, I donā€™t think SkipGram is the default training technique and my previous post tried to separate the actual layer (nn.Embedding) vs. any learning technique (e.g. SkipGram).

1 Like

I think the upshot is nn.Embedding can be viewed as a map from token_id ā†’ vector. How you get this vector is up to you. What the token is us up to you also.

  1. Skip gram negative sampling: https://arxiv.org/pdf/1310.4546.pdf
    Note here you have technically two embedding tables (context and output). For each token, you can say itā€™s embedding is either the context embedding or the average of the context and output embeddings. For each word, predict the surrounding words.
  2. CBOW
    Same as above. Different model architecture: for each word, the surrounding words predict the word.
  3. Fasttext: https://fasttext.cc/
    Here the tokens are not words but ngrams of chars. For a word, you get the embedding by averaging the embeddings over all continuous ngrams in the word.
  4. Glove: GloVe: Global Vectors for Word Representation
    More of a global objective than the above. Different notion of similarity between words. But, the bottom line is you still have words as tokens and each word gets an embedding.

All of the above embeddings are such that a unique token has a unique (universal) vector for the model. There is no notion on ā€œcontextā€. Transformers get token representations with-context. So, in this case you feed in ā€œthe river bankā€ and ā€œI robbed a bankā€ and you get different vector representations for the token ā€œriverā€ (assuming words are the tokens here). For SkipGram, bank has just one embedding, which you learn via optimization.

Basically Embedding is a building block - your model and configuration decides what it contains and how you train it.

3 Likes

SkipGram is not a training technique. SkipGram or CBOW are rather network architectures designed to learn meaningful word vectors. For example, SkipGram is set up to predict the surrounding words given a target word (CBOW is vice versa). After the SkipGram model is train using a large corpus, one can extract ā€œsome weightsā€ from the model that represent word vectors.

This word vectors can be downloaded and you can use them as initial weights for you nn.Embedding layer. Still, they are just some weights. So if you donā€™t freeze the embedding layer, they get updated during backpropagation like any other parts of your network.

1 Like

But what if my vocabulary is very large, yet I havenā€™t obtained the indices of all my words.
Does nn.Embedding accept indices out of the range of what was given at the initialization?
I cannot load my vocabulary all at once, which means i even donā€™t know its size yet, i am batching through texts and there could be new words always. How can i fit that layer with those words dynamically?

1 Like

Here is a definition you can make use of and some tests showing it works:

import torch
import torch.nn as nn

def expand_emb(emb_layer, num_new_emb:int):
    with torch.no_grad():
        new_weights = torch.randn((num_new_emb, emb_layer.embedding_dim))
        emb_layer.weight.data = torch.cat([emb_layer.weight.data,new_weights])
        emb_layer.num_embeddings+=num_new_emb
    return emb_layer

model = nn.Embedding(10000, 300)

print(model.weight.data.size())

dummy_inputs = torch.randint(15000,(12, 1200))

# test model, note int size is larger than allowed so we expect an error
try:
    print(model(dummy_inputs).size())
except Exception as e:
    print(e)

#add new embeddings
model = expand_emb(model, 5000)
#check new size
print(model.weight.data.size())

# test model again
try:
    print(model(dummy_inputs).size())
    print("Model successful")
except Exception as e:
    print(e)

On a side note, when you add new embeddings, during a training regimen, take note that your learning rate may be lower than when the initial embeddings were trained, and thus the new embeddings may not perform as well as the original embeddings.

2 Likes