How does nn.Embedding work?

Both nn.Linear and nn.Embedding will given you, in your example, a 3-dim vector. That’s the whole point, i.e., to convert a word into an ideally meaningful vectors (i.e., a numeric and fix-sized representation of a word). The difference is w.r.t. the input

  • nn.Linear expects a one-hot vector of the size of the vocabulary with the single 1 at the index representing the specific word

  • nn.Embedding just expects this index (and not a whole vector)

However, if both nn.Linear and nn.Embedding would be initialized with the same weights, their outputs would be exactly the same.

Yes, by default, the weights of both layers will be modified during the training process. In this respect, there are like any other layers in your network. However, you can tell the network not to modify the weights of any specific layer; I think it would look something like this:

embedding = nn.Embedding(10, 3)
embedding = weight.requires_grad = False

This makes sense if you use pretrained word embeddings such as Word2Vec or Glove. If you initialize your weights randomly, you certainly want them to be modified during training.

Pytorch has a decent tutorial on this:

https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

Yes, the embedding layer is learnable. Ideally, the model should establish its own vector representations of the words, and this is the space where that semantic meaning gets defined.

Hence, there are downloadable pretrained vectors. However, take some caution with these as they were developed from scrapped data and contain a lot of irrelevant representations, such as website links and pure gibberish.

2 Likes

Hey Li-Syuan,
I think I the answer you were trying to get is “it’s random”.
I was thinking exactly the same things as you did.
The nn.embedding simply gives you a random tensor corresponding to each input id of a word, which then can be updated by your downstream task.

nn.embedding takes two necessary arguments, right?
The first one is basically the size of vocab, which you can very arbitrarily set; let’s say you picked 3.
The second one is the input dimension, which you can also very arbitrarily set; let’s say you picked 5.

Once you set it, then under the hood, you can think of nn.Embedding as this:
{
0: [0.123, 0.223, -.123, -.123, .322] # a completely random 5-dimensional representation of whatever token 0 corresponds to (it’s 5 dimensionl because you set it to 5)
1: [0.45, .123123, .123123, .123123, .123123], # a completely random 5-dimensional representation of whatever token 1 corresponds to, (it’s 5 dimensionl because you set it to 5)
2: [0.656, .4564, .456456, .456456, .4564] # a completely random 5-dimensional representation of whatever token 2 corresponds to, (it’s 5 dimensionl because you set it to 5)
}
There are only three entries in the dict because you said vocabulary size is 3.

Then, when you pass a text
You have some sort of tokenizer that maps text into indexes
say your tokenizer does this:
{
“I”: 0,
“love”: 1,
"cat: 2
}
Then, when you pass the text “I love cat” to your tokenizer, it becomes [0, 1, 2]
This [0, 1, 2], when passed to the nn.Embedding
becomes a tensor like this:
[ [0.123, 0.223, -.123, -.123, .322],
[0.45, .123123, .123123, .123123, .123123],
[0.656, .4564, .456456, .456456, .4564]]

If you had an arbitray task, say “is this a gramatically correct sentence”? 1= yes, 0 = no
then your model will learn something like this:
I love cat → label is 1
[ [0.123, 0.223, -.123, -.123, .322],
[0.45, .123123, .123123, .123123, .123123],
[0.656, .4564, .456456, .456456, .4564]]
→ 1

love I cat → no
[[0.45, .123123, .123123, .123123, .123123],
[0.123, 0.223, -.123, -.123, .322],
[0.656, .4564, .456456, .456456, .4564]]
→ no


and the random representations of each word would be updated accordingly so they are no longer random.

what happens if you pass an index larger than 2?
well, it’s out of vocabury because you only prepared 3 spots (index 0, 1, 2) in your nn.Embedding and you’ll get an error.

I hope that helped.
The same question bothered me for very long too!

2 Likes

@husohome Thanks for your response and I get a clear understanding of nn.Embedding.
I think we can treat nn.Embedding as a mapping function which maps a discrete variables (word index) to continuous ones, which could be learnable (differentiable), decrease memory usage and lower the sparsity for a large vocabulary size. Is this thought correct?

Yes. And you can check out this link.
What embedding is trying to do is to represent the large search space of vocab into a smaller dimension where meaningful data lie.

1 Like

Perfectly summarized…! And added difference is that, for nn.Embeddings there isn’t any bias vector

Yes, the basic Word2Vec training setup does not guarantee normalized vectors. But it can be enforced during or after training, and I think most to all pretrained word embeddings you can download are in fact normalized.