How does nn.Embedding work?

I am new in the NLP field am I have some question about nn.Embedding. I have already seen this post, but I’m still confusing with how nn.Embedding generate the vector representation. From the official website and the answer in this post. I concluded:

  1. It’s only a lookup table, given the index, it will return the corresponding vector.
  2. The vector representation indicated the weighted matrix is initialized as random values and will be updated by backpropagation.

Now my question are:

  1. If I have 1000 words, using nn.Embedding(1000, 30) to make 30 dimension vectors of each word. Will nn.Embedding generate one-hot vector of each word and create a hidden layer of 30 neuron like word2vec? If so, is it CBOW or Skip-Gram model? What’s the difference between nn.Embedding and nn.Linear
  2. Now, I am researching about the Visual-Question-Answering tasks. Any suggestion about using pretrain vector representation, e.g. word2vec, in question vocabulary or not using it?
11 Likes
  1. It’s neither CBOW nor Skip-Gram, which are models trained end-to-end to predict the context given the word, or the word given the context. Here, nn.Embedding is optimized as part of your training task. Consequently, after training, you can expect to have embeddings that are specific to your task rather than generic embeddings that generate a more general representation which can be more or less useful depending on your task. To illustrate, you can imagine having a screwdriver that works on some screws but if your screw is very specific, then you will need to change your tool for your specific task.
  2. Using pretrained vectors is generally a good starting point but you should consider retraining/fine-tuning the vectors for your specific task

[EDIT]: difference between Embedding and Linear: Linear expects vectors (e.g. one-hot representation of the words), Embedding expects tokens (e.g. words index)

8 Likes

An Embedding layer is essentially just a Linear layer. So you could define a your layer as nn.Linear(1000, 30), and represent each word as a one-hot vector, e.g., [0,0,1,0,...,0] (the length of the vector is 1,000).

As you can see, any word is a unique vector of size 1,000 with a 1 in a unique position, compared to all other words. Now giving such a vector v with v[2]=1 (cf. example vector above) to the Linear layer gives you simply the 2nd row of that layer.

nn.Embedding just simplifies this. Instead of giving it a big one-hot vector, you just give it an index. This index basically is the same as the position of the single 1 in the one-hot vector.

35 Likes

You explain very clearly! The last question: is the nn.Embedding do the same thing like nn.Linear? But nn.Linear should input a vector representation (e.g, using one-hot) so it can build a look-up table. The nn.Embedding do the same thing but simplified it just using word-index. Let me know if I misunderstand any part. Also, is there any document can be refered? I have searched for a long while!

It seems you want to implement the CBOW setup of Word2Vec. You can easily find PyTorch implementations for that. For example, I found this implementation in 10 seconds :).

This example uses nn.Embedding so the inputs of the forward() method is a list of word indexes (the implementation doesn’t seem to use batches). But yes, instead of nn.Embedding you could use nn.Linear. The only change needed would be that inputs not has to be a list of one-hot vectors. But I wouldn’t bother, nn.Embedding keeps things simpler.

No, I don’t want to implement CBOW. I just want to know the corresponding architecture matching with nn.Embedding and if I understand it right or not. So, it looks like the nn.Embedding is equal to the red square part in the picture.

‘nn.Embedding’ is no architecture, it’s a simple layer at best. In fact, it’s a linear layer just with a specific use.

Internally, nn.Embedding is – like a linear layer – a M x N matrix, with M being the number of words and N being the size of each word vector. There’s nothing more to it. It just matches a word (specified by an index) to the corresponding word vector, i.e., the corresponding row in the matrix.

6 Likes

Does PyTorch treat backpropagation of (1-hot input to linear layer) the same as (index selection of embedding)?

I’m guessing that PyTorch will calculate the gradient for all entries of the linear layer and all but one will be zero given the 1 hot input. (ie: lots of computation for a large linear layer). Or does Pytorch optimize this out?

Will Pytorch do the same for embedding or will PyTorch initialize and backpropagate only to the index embedding?

Assuming limited GPU memory and large CPU memory. Do both share the same minimal (> zero) amount of data that can be sent to GPU?

1 Like

@John_Grabner To be honest, I have no idea. Maybe someone with a deeper knowledge of the implementation will reply to this.

If I would have to guess, I would say that this is optimized. After all, this is kind of the purpose of an embedding layer compared to a general linaer layer: exploiting the knowledge that just one index/row is affected – or multiple indices/rows in case multiple inputs.

Just don’t quote me on this :slight_smile:

Sparse gradients mode can be enabled for nn.Embedding, with it gradient elementwise mean & variance estimates are updated correctly (for specific optimizers); but this may not reduce peak memory, as that gradient tensor is short-lived (due to reverse mode backprop). Instead, it is possible to not move nn.Embedding to GPU (moving just a lookup results in forward() instead).

Transformers most often have as input the addition of something and a position embedding.
For example, position 1 to 128 represented as torch.nn.Embedding(num_embeddings=128.
I never see torch.nn.Linear to project a float position to embedding. Nor do I see the sparce flag set for the embedding.

If they (Linear and Embedding) are essentially the same, I would assume some people would choose the linear projection (cleaner in my mind when the embedding is for position).

In non AI, non backprogration, a lookup can be implemented much more efficiently than multiplying an array by a mask. Especially so if the table is large.

I believe BERT usage of transformer use very large embedding (52K) to represent words in addition to embeddings for word position.

Scavenged the GitHub repo for PyTorch and found Embedding.cpp in the call path of nn.Embedding. No idea of how this code does its magic, but embedding_dense_backward_cpu has a bunch of if statements before adding grad_weights while Linear.cpp does a multiplication.

So I’m guessing embedding is much faster in backpropagation over linear especially when large embedding are used. If small embeddings, then essentially the same.

Hoping someone who understands the PyTorch implementation to say for sure.

Hi Chris,

I’m just trying to make the connection between nn.Embedding and nn.Linear. I think I understand what an embedding is: A representation of the input in a different vector space. Would you mind clarifying this point:

Now giving such a vector v with v[2]=1 (cf. example vector above) to the Linear layer gives you simply the 2nd row of that layer.

I think I’ve made embeddings before by training an LSTM autoencoder to reconstruct the input sequence from the final hidden state. I thought the embedding would then be the hidden state of the encoder.

I’ve only ever done time-series related work with LSTMs, but am trying to learn their NLP applications and can’t find a word written on NLP that doesn’t include the use of an nn.Embedding layer.

When you time series analysis, your input is presumably already numerical, so the notion of an embedding is not an issues there.

In NLP you typically deal with words, i.e., non-numerical input. So you have to encode your sentences, paragraphs, documents, etc. somehow for the model to “understand” them. The naive approach would be to One-Hot encoding where each word is a vector of the size of your vocabulary with only a single 1 at the index of the corresponding word. For example if the index of word “hello” has the index 42 in your vocabulary, the One-Hot encoded word vector for “hello” would look like:

[0 0 0 0 0 ... 0 0 1 0 ... 0 0 0 0 0 0 ]

with the whole vector being vocab_size long, and the 1 is at index 42. The problem is that for many NLP applications, your vocabulary can be very large, e.g., way beyond 100,000 words (but let’s stick with 100k). Now, if your input document has, say 100 words, then your input for your model is a M.shape = (100, 100000). This is annoyingly large…and unnecessarily so, as we see in a bit…

Your first layer getting this input is typically an Linear layer of, say, shape E.shape = (100000, 300) to reduce the dimensionality. If you do the calculations, you will notice that M.dot(E) does nothing more than selecting the rows in E that correspond to the indices of the 1’s in M (because M is One-Hot encoded). If this is unclear, I recommend doing this with a toy example on paper.

Knowing this, there’s no need for the One-Hot encoded matrix M anymore, we only need the indices of the words. nn.Embedding is more or less just a linear layer to facilitate the M.dot(E) but without the need of the large matrix M. This is why you don’t see any example in NLP without an explicit embedding layer (an exception might be character-based models where you vocabulary is small, e.g., < 100 characters).

6 Likes

Ah I see what you are saying! M acts as a sort of permutation matrix on E. I did it on paper and see exactly what you mean now. That also brings clarity to something I’ve often read, that nn.Embedding acts as a sort of lookup table. Okay, great! This has been a big question for me for a while now. Thank you!

Yes, multiplying a one-hot vector with the embedding matrix basically just makes a lookup.

I also assume(!) that nn.Embedding does backpropagation smarter since it knows which rows in the matrix can only be affected. But I don’t really know about the inner workings.

can you show your paper? or make a pic to show? i am still not very sure about this, especially the size parameter
nn.Embedding(num_users, emb_size)

So the idea is that a linear layer is simply a matrix multiplication of the inputs by the weights defined by the layer:

y = x * A.T + b

Where A is the weight matrix of the linear layer.

Now, imagine that x is one hot encoded vector representing a single word in your vocabulary. This vector is all zeros with a single one. Then you can ask yourself, what is the output of the multiplication x * A.T? The result is a single row of A. If this doesn’t make sense, try it yourself on paper. Multiply a sparse vector by any dense matrix A.

After you have confirmed that what I say is true we can imagine next steps:

For one-hot vector x_i (for each word in the vocabulary):
1. Compute y_i = x_i * A
2. Stack y_i for i = 1 ... vocab_size into a new of row vectors… matrix E

Now, this matrix E serves two purposes. It is both a linear transformation whose weights can be updated/learned AND a look-up table for your vocabulary.

I hope that makes sense!

1 Like

Thank you for your instructive answer. I’m just wondering if nn.Embedding(…) includes a mechanism to avoid duplicate representation for different words (i.e having duplicate rows in it’s parameter Matrix).

It does not. Implementation is here. However the weights are (usually—it’s up to you) initialized randomly and the probability of a collision is theoretically 0. In practice (since machines are digital) the probability of a collision is not actually zero but effectively so for all intents and purposes.