How nn.Embedding trained?

juhyung · December 19, 2018, 5:20am

when i have vocab size of 40000 and want to embed this to 300

I use nn.Embedding(40000, 300)

then How is embeddings are trained?? Since it is not kind of word2vec task, has no label for each word.

se7en · December 19, 2018, 6:28am

Embedding is not for training, it’s a lookup table. You first map each word in the vocabulary to a unique integer index, and then the nn.Embedding just map this index to a vector with size of 300.

juhyung · December 19, 2018, 6:41am

if 300 vector is not trained, that word vector does not represent relations among words. right??

se7en · December 19, 2018, 6:47am

https://stackoverflow.com/questions/44881999/word-embedding-lookuptable-word-embedding-visualizations

I found this post very helpful. I think the nn.Embedding just initialize the lookup table, and thereafter you train it with gradient descent.

ptrblck · December 19, 2018, 11:16am

nn.Embedding acts like a trainable lookup table.
The relations between words will be learned during its training.
This blog post might be useful to get some intuition on this layer.

Adamits · December 19, 2018, 4:15pm

The 40,000 word vectors are learned as another parameter of the network that you train. There is a lot of literature on pretraining word embeddings using LSA or the W2V algorithms, but by initializing random vectors (in this size of dimension 300), and applying updates to those with backpropogation, we can learn good approximations of such vectors that are tuned to the objective of your NN.

maqboolurrahim · November 30, 2021, 8:47pm

Thanks @ptrblck. But I have another question how does Pytorch learn these embeddings? does it use word2vec Skipgram or CBOW kind of model? I mean can you kindly provide more details?
thanks

ptrblck · December 1, 2021, 12:50am

An embedding layer is a simple lookup table accepting a sparse input (word index) which will be mapped to a dense representation (feature tensor). The embedding weight matrix will get gradients and will thus be updated. SkipGram etc. would refer to a training technique and your model might use embedding layers for it.

HumanIearning · January 23, 2023, 11:10am

than, you mean default training technique of nn.embedding is SkipGram?

ptrblck · January 23, 2023, 5:49pm

No, I don’t think SkipGram is the default training technique and my previous post tried to separate the actual layer (nn.Embedding) vs. any learning technique (e.g. SkipGram).

dreidizzle · January 24, 2023, 2:03pm

I think the upshot is nn.Embedding can be viewed as a map from token_id → vector. How you get this vector is up to you. What the token is us up to you also.

Skip gram negative sampling: https://arxiv.org/pdf/1310.4546.pdf
Note here you have technically two embedding tables (context and output). For each token, you can say it’s embedding is either the context embedding or the average of the context and output embeddings. For each word, predict the surrounding words.
CBOW
Same as above. Different model architecture: for each word, the surrounding words predict the word.
Fasttext: https://fasttext.cc/
Here the tokens are not words but ngrams of chars. For a word, you get the embedding by averaging the embeddings over all continuous ngrams in the word.
Glove: GloVe: Global Vectors for Word Representation
More of a global objective than the above. Different notion of similarity between words. But, the bottom line is you still have words as tokens and each word gets an embedding.

All of the above embeddings are such that a unique token has a unique (universal) vector for the model. There is no notion on “context”. Transformers get token representations with-context. So, in this case you feed in “the river bank” and “I robbed a bank” and you get different vector representations for the token “river” (assuming words are the tokens here). For SkipGram, bank has just one embedding, which you learn via optimization.

Basically Embedding is a building block - your model and configuration decides what it contains and how you train it.

vdw · January 25, 2023, 12:08am

SkipGram is not a training technique. SkipGram or CBOW are rather network architectures designed to learn meaningful word vectors. For example, SkipGram is set up to predict the surrounding words given a target word (CBOW is vice versa). After the SkipGram model is train using a large corpus, one can extract “some weights” from the model that represent word vectors.

This word vectors can be downloaded and you can use them as initial weights for you nn.Embedding layer. Still, they are just some weights. So if you don’t freeze the embedding layer, they get updated during backpropagation like any other parts of your network.

mo_hd · June 15, 2023, 9:37am

But what if my vocabulary is very large, yet I haven’t obtained the indices of all my words.
Does nn.Embedding accept indices out of the range of what was given at the initialization?
I cannot load my vocabulary all at once, which means i even don’t know its size yet, i am batching through texts and there could be new words always. How can i fit that layer with those words dynamically?

J_Johnson · June 15, 2023, 10:13am

Here is a definition you can make use of and some tests showing it works:

import torch
import torch.nn as nn

def expand_emb(emb_layer, num_new_emb:int):
    with torch.no_grad():
        new_weights = torch.randn((num_new_emb, emb_layer.embedding_dim))
        emb_layer.weight.data = torch.cat([emb_layer.weight.data,new_weights])
        emb_layer.num_embeddings+=num_new_emb
    return emb_layer

model = nn.Embedding(10000, 300)

print(model.weight.data.size())

dummy_inputs = torch.randint(15000,(12, 1200))

# test model, note int size is larger than allowed so we expect an error
try:
    print(model(dummy_inputs).size())
except Exception as e:
    print(e)

#add new embeddings
model = expand_emb(model, 5000)
#check new size
print(model.weight.data.size())

# test model again
try:
    print(model(dummy_inputs).size())
    print("Model successful")
except Exception as e:
    print(e)

On a side note, when you add new embeddings, during a training regimen, take note that your learning rate may be lower than when the initial embeddings were trained, and thus the new embeddings may not perform as well as the original embeddings.