How should I understand the num_embeddings and embedding_dim arguments for nn.Embedding?

Hello. I’m aware that this question (and many similar ones) have already been asked on this forum and Stack Overflow, but I’m still having trouble grasping how the concept works and wanted to ask a question based on a specific toy example that I went through.

I’m aware that the num_embeddings argument refers to how many elements we have in our vocabulary, and embedding_dim is simply referring to how many dimensions we want to make the embeddings.

The specific code that I tried is as follows:

import torch
import torch.nn as nn


embedding = nn.Embedding(num_embeddings=10, embedding_dim=3)

a = torch.LongTensor([[1, 2, 3, 4], [4, 3, 2, 1]]) # (2, 4)

b = torch.LongTensor([[1, 2, 3], [2, 3, 1], [4, 5, 6], [3, 3, 3], [2, 1, 2],
                      [6, 7, 8], [2, 5, 2], [3, 5, 8], [2, 3, 6], [8, 9, 6],
                      [2, 6, 3], [6, 5, 4], [2, 6, 5]]) # (13, 3)

c = torch.LongTensor([[1, 2, 3, 2, 1, 2, 3, 3, 3, 3, 3],
                      [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]]) # (2, 11)

If I run a, b, and c through embedding then I get embedding tensors each of shape (2, 4, 3), (13, 3, 3), (2, 11, 3).

My question here is, shouldn’t b give me an index out of range error, since it’s a tensor consisting of 13 words each of dimension 3, and hence is outside the range of the predefined 10?

Any tips or pointers are appreciated. Thanks in advance.

1 Like

I think when you do

embedding = nn.Embedding(num_embeddings=10, embedding_dim=3)

then it means that you have 10 words and represent each of those words by an embedding of size 3, for example, if you have words like

hello
world

and so on, then each of these would be represented by 3 numbers,
one example would be,

hello -> [0.01 0.2 0.5]
world -> [0.04 0.6 0.7]

and so on, if you do

list(embedding.parameters())

then you will get something like this,

[Parameter containing:
 tensor([[ 0.9227,  0.6492, -1.1440],
         [ 1.5318, -0.2873, -0.7290],
         [-0.4234, -1.7012, -0.9684],
         [-0.2859,  1.4677, -1.4499],
         [-1.8966, -1.4591,  0.5218],
         [ 2.4023, -1.5395, -0.7947],
         [-0.0464,  0.7174, -0.7452],
         [ 0.9500, -0.4633,  0.5398],
         [ 0.3458, -0.7997,  0.8895],
         [-0.3303, -0.5663, -0.2300]], requires_grad=True)]

which represents how are each of these words represented,

when you do,

a = torch.LongTensor([[1, 2, 3, 4], [4, 3, 2, 1]]) # (2, 4)

and then

embedding(a).shape

it gives

torch.Size([2, 4, 3])

while

embedding(a)

gives

tensor([[[ 1.5318, -0.2873, -0.7290],
         [-0.4234, -1.7012, -0.9684],
         [-0.2859,  1.4677, -1.4499],
         [-1.8966, -1.4591,  0.5218]],

        [[-1.8966, -1.4591,  0.5218],
         [-0.2859,  1.4677, -1.4499],
         [-0.4234, -1.7012, -0.9684],
         [ 1.5318, -0.2873, -0.7290]]], grad_fn=<EmbeddingBackward>)

because you are retrieving the embeddings of those words, means, you are asking give me the embedding of word at index 1, give me embedding of word at index 2, and so on. So, it gives you embeddings of words at indices that you asked.

when you do

b = torch.LongTensor([[1, 2, 3], [2, 3, 1], [4, 5, 6], [3, 3, 3], [2, 1, 2],
                      [6, 7, 8], [2, 5, 2], [3, 5, 8], [2, 3, 6], [8, 9, 6],
                      [2, 6, 3], [6, 5, 4], [2, 6, 5]]) # (13, 3)
embedding(b)

then it means gives me the embedding of word at index 1, give me embedding of word at index 2, then 3, then 2, then 3, then 1 and so on.

here, ‘a’ and ‘b’ contain indices of words you want to retrieve the embedding for.

3 Likes

The values in a must all be between 1 and num_embeddings:

for i in range(num_embeddings + 4):
    try:
        embedding(torch.arange(i))
    except:
        print(f"failed for i={i}")

The input data can be any shape:

try:
    embedding(torch.randint(1, num_embeddings, size=(2,3,4,5,6)))
    print("it worked")
except:
    print("this won't print because this won't fail")

Each entry in your input tensor is mapped to a vector with 3 coordinates (a 3-dimensional vector in mathematical terminology, but not in the sense of PyTorch tensors), which can be found in the last axis of the output tensor. Namely, a[i][j] is mapped to the vector embedding(a)[i][j], which is a 0-dimensional tensor with 3 components.

Interpretation of dimensions
Say for example your input tensor has shape (13, 3), with values between 1 and 10. If this were to represent text, then you can think of it as having 13 samples of text each containing 3 words each, and each word is taken from a vocabulary of 10 words.

1 Like

Hello, Do not forget that the 13 is just a dimension. What really matters is the indexes in the b vector. Your number of embedding is 10. So the values of your input all have to be less than or equal to 10 and you have satisfied that condition in the b tensor. All of the 13 by 3 tensors have be projected to a 13 by 3 by 3 space. Thank you

what is the use of embedding, how people use it ?

Hello vainaijr,

I agree with you. I would add that PyTorch’s Tutorial specifically on Word Embeddings does a good job with communicating an intuition (https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).

As you have kind of focused on the transformer architecture and also written about a use in CV, I just want to throw in this relatively new paper which could be interesting (https://openreview.net/forum?id=YicbFdNTTy).

Furthermore, I don’t think these embeddings (speaking about word embeddings) claim to consider order, that’s why we have positional encoding in transformers for instance.
I guess one of the main advantages in using (word) embeddings is that we have dense vectors and also the ability to ‘compare the meanings’ of words just using the embeddings.

Greetings,
Unity05

1 Like

thank you so much for the explanation it clears out a lot of fogs for me.
do you by any chance know a simple example (beyond the pytorch tutorial) that I can look into for understanding even better?

For Question Answering tasks, can we use nn.Embedding to represent role embeddings? In other words, the agent role embedding and the user role embedding are both trainable. The reasoning behind this is to use some embedding representations for agent and user utterances (GloVe, fastText, nn.Embedding, BERT embeddings, etc.), and we could add these trainable embeddings to the utterance representation according to the role utterance to help the model distinguish agent and user utterances.

For example, the authors of Multi-domain Dialogue State Tracking as Dynamic Knowledge Graph Enhanced Question Answering apply this idea (search for role embedding on the .pdf). I would like to do something similar with torch.

Hi,

spontaneously, I don’t see a reason, why it should not work. I haven’t known this paper before and I’m curious if you’ve tried it. If yes, did it go well? ^^