How to implement skip-gram or CBOW in pytorch

WMN7 · June 11, 2019, 1:46pm

I just learn about word embedding and I think the word vector can be learned by CBOW or Skip-gram procedure. And I have two questions about word embedding in Pytorch.

The first one–How to understand nn.Embedding in Pytorch

I think I don’t have a good understanding of Embedding in Pytorch. Is the nn.Embedding has the same function with nn.Linear in Pytorch. I think the nn.Embedding just like shallow fully connected network.

And, if not, how the weights of nn.Embedding are fine-tuned during the training process.

The second one–How to implement skip-gram(or CBOW) in Pytorch

The second question, I want to to know how to implement skip-gram(or CBOW) in Pytorch, are the following two networks correctly implement CBOW and Skip-gram. (The weight of nn.Embedding is the word vector)

class CBOWModeler(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOWModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)
        
    def forward(self, x):
        embeds = self.embeddings(x).view(1,-1)
        output = self.linear1(embeds)
        output = F.relu(output)
        output = self.linear2(output)
        log_probs = F.log_softmax(output, dim=1)
        return log_probs

class SkipgramModeler(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(SkipgramModeler, self).__init__()
        self.context_size = context_size
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.linear2 = nn.Linear(128, context_size*vocab_size)
    
    def forward(self,x):
        embeds = self.embeddings(x).view(1,-1)
        output = self.linear1(embeds)
        output = F.relu(output)
        output = self.linear2(output)
        log_probs = F.log_softmax(output, dim=1).view(self.context_size, -1)
        return log_probs

So, if I want to use embedding in nlp task, should I first train like above to obtain to weight of embedding.

Thanks for replying.

n0obcoder · June 12, 2019, 9:06am

Treat nn.Embedding as a lookup table where the key is the word index and the value is the corresponding word vector. However, before using it you should specify the size of the lookup table, and initialize the word vectors.

Not all the weights are trained at the same time in this nn.Embedding. Weight training would depend on your training pairs. For example, lets say (‘Bruce’ , ‘Wayne’) is a training pair. Assuming that ‘Bruce’ and ‘Wayne’ words are present in your vocabulary with indices 100 and 200(just an example), nn.Embedding would allow you to pick the untrained word vectors for these two indices. This these two vectors would be brought closer to each other, resulting in their training.

Remember nn.Embedding is a lookup table. You just need to give in the indices of the words , and it gives you the word vectors for those words.
You can have a look at this pytorch implementation of Skip-Gram model

I hope it helped you !!!

WMN7 · June 12, 2019, 9:49am

Thanks for your replying.

And about the first question, I still have some questions. As you said below:

For example, lets say (‘Bruce’ , ‘Wayne’) is a training pair. Assuming that ‘Bruce’ and ‘Wayne’ words are present in your vocabulary with indices 100 and 200(just an example), nn.Embedding would allow you to pick the untrained word vectors for these two indices. This these two vectors would be brought closer to each other, resulting in their training.

How can these two vectors bring closer to each other. The following are some of my thoughts, and I don’t know whether it is right.

I want to make this problem easy to explain, so just assume there are only two words in all, so the word2index is {'Bruce':0, '‘Wayne’:1}. And we assume the initial weight of nn.Embedding is [[0.1,0.2,0.3],[0.2,0.5,0.6]](This is the random number).

When we input word ‘Wayne’(the indice 1) to the nn.Embedding, is it equal to Matrix multiplication, as shown in the picture below.

Snipaste_2019-06-12_17-45-31

In this way, the network can do back-propagated to modify the weights of nn.Embedding. So, I think it is very similar to nn.Linear.

n0obcoder · June 12, 2019, 10:16am

no, you dont need to do matrix multiplication to get the word embedding for the word ‘Wayne’. All you need to do is pass the index of ‘Wayne’ (index = 1) and the nn.Embedding gets your the row =1.

nn.Linear would do matrix multiplication and the result would have been same. But that would be computationally expensive. This is where nn.Embeddings come into picture. All they need is the indices of the words, and they return the word vectors of those words. Its just a lookup table. Less computation is required in this case.

WMN7 · June 12, 2019, 10:29am

Maybe I don’t make this question clear.

I know we can just put the index of word to nn.Embedding and then we can get the word vector.

But if the nn.Embedding is just a lookup table, how can it use BP algorithm to optimize its weight.(If it don’t do matrix multiply.)

dhananjayraut · June 12, 2019, 10:33am

That’s exactly why you use nn.Embedding .
so that the backpropagation can happen. it is handled in the library.

n0obcoder · June 12, 2019, 10:37am

in nn.Embeddings back propagation wouldnt happen on the entire matrix. Back propagation will be done only on the rows of the embedding matrix whose indices are passed.

WMN7 · June 12, 2019, 11:06am

Thanks for your patient explanation. I can probably understand it now.

I now think nn.Embedding optimizes for embedding(index to vector). We can use nn.Linear to achieve the same result, but if we use nn.Embedding, it can have less computation because b-p wouldn’t happen on the entire matrix.

n0obcoder · June 12, 2019, 11:47am

its always good to see a person getting satisfied with your explaination

Fei_Gao · December 3, 2020, 11:15am

Hello, thanks for your kind explanation, I now learn the difference between nn.Embedding and nn.Linear，But I cannot find the implementation of BP from the pytorch code, /home/linux/miniconda3/lib/python3.7/site-packages/torch/nn/modules/sparse.py.
I find the F.embedding function in /home/linux/miniconda3/lib/python3.7/site-packages/torch/nn/functional.pyi, which only a function defination:

def embedding(
    input: Tensor, weight: Tensor, padding_idx: Optional[int] = ...,
    max_norm: Optional[float] = ..., norm_type: float = ...,
    scale_grad_by_freq: bool = ..., sparse: bool = ...) -> Tensor: ...

jmaronas · March 4, 2025, 2:15pm

from what I can see in your blog, I think your implementation of the skip-gram model is not very precise, since starts from one-hot vectors rather than randomized vectors.

In the paper it says: More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer.

So from what I have understood from paper equations basically take random dense embed matrix W and use it as the linear projection to the softmax output, when multiplied by input vector.