How to implement skip-gram or CBOW in pytorch

#1

I just learn about word embedding and I think the word vector can be learned by CBOW or Skip-gram procedure. And I have two questions about word embedding in Pytorch.

The first one–How to understand nn.Embedding in Pytorch

I think I don’t have a good understanding of Embedding in Pytorch. Is the nn.Embedding has the same function with nn.Linear in Pytorch. I think the nn.Embedding just like shallow fully connected network.

And, if not, how the weights of nn.Embedding are fine-tuned during the training process.


The second one–How to implement skip-gram(or CBOW) in Pytorch

The second question, I want to to know how to implement skip-gram(or CBOW) in Pytorch, are the following two networks correctly implement CBOW and Skip-gram. (The weight of nn.Embedding is the word vector)

class CBOWModeler(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOWModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)
        
    def forward(self, x):
        embeds = self.embeddings(x).view(1,-1)
        output = self.linear1(embeds)
        output = F.relu(output)
        output = self.linear2(output)
        log_probs = F.log_softmax(output, dim=1)
        return log_probs
class SkipgramModeler(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(SkipgramModeler, self).__init__()
        self.context_size = context_size
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.linear2 = nn.Linear(128, context_size*vocab_size)
    
    def forward(self,x):
        embeds = self.embeddings(x).view(1,-1)
        output = self.linear1(embeds)
        output = F.relu(output)
        output = self.linear2(output)
        log_probs = F.log_softmax(output, dim=1).view(self.context_size, -1)
        return log_probs

So, if I want to use embedding in nlp task, should I first train like above to obtain to weight of embedding.

Thanks for replying.

(n0obcoder) #2
  1. Treat nn.Embedding as a lookup table where the key is the word index and the value is the corresponding word vector. However, before using it you should specify the size of the lookup table, and initialize the word vectors.

    Not all the weights are trained at the same time in this nn.Embedding. Weight training would depend on your training pairs. For example, lets say (‘Bruce’ , ‘Wayne’) is a training pair. Assuming that ‘Bruce’ and ‘Wayne’ words are present in your vocabulary with indices 100 and 200(just an example), nn.Embedding would allow you to pick the untrained word vectors for these two indices. This these two vectors would be brought closer to each other, resulting in their training.

    Remember nn.Embedding is a lookup table. You just need to give in the indices of the words , and it gives you the word vectors for those words.

  2. You can have a look at this pytorch implementation of Skip-Gram model

I hope it helped you !!! :slight_smile:

#3

Thanks for your replying.

And about the first question, I still have some questions. As you said below:

For example, lets say (‘Bruce’ , ‘Wayne’) is a training pair. Assuming that ‘Bruce’ and ‘Wayne’ words are present in your vocabulary with indices 100 and 200(just an example), nn.Embedding would allow you to pick the untrained word vectors for these two indices. This these two vectors would be brought closer to each other, resulting in their training.

How can these two vectors bring closer to each other. The following are some of my thoughts, and I don’t know whether it is right.


I want to make this problem easy to explain, so just assume there are only two words in all, so the word2index is {'Bruce':0, '‘Wayne’:1}. And we assume the initial weight of nn.Embedding is [[0.1,0.2,0.3],[0.2,0.5,0.6]](This is the random number).

When we input word ‘Wayne’(the indice 1) to the nn.Embedding, is it equal to Matrix multiplication, as shown in the picture below.

Snipaste_2019-06-12_17-45-31

In this way, the network can do back-propagated to modify the weights of nn.Embedding. So, I think it is very similar to nn.Linear.

(n0obcoder) #4

no, you dont need to do matrix multiplication to get the word embedding for the word ‘Wayne’. All you need to do is pass the index of ‘Wayne’ (index = 1) and the nn.Embedding gets your the row =1.

nn.Linear would do matrix multiplication and the result would have been same. But that would be computationally expensive. This is where nn.Embeddings come into picture. All they need is the indices of the words, and they return the word vectors of those words. Its just a lookup table. Less computation is required in this case.

#5

Maybe I don’t make this question clear.

I know we can just put the index of word to nn.Embedding and then we can get the word vector.

But if the nn.Embedding is just a lookup table, how can it use BP algorithm to optimize its weight.(If it don’t do matrix multiply.)

(Dhananjay Raut) #6

That’s exactly why you use nn.Embedding .
so that the backpropagation can happen. it is handled in the library.

(n0obcoder) #7

in nn.Embeddings back propagation wouldnt happen on the entire matrix. Back propagation will be done only on the rows of the embedding matrix whose indices are passed.

#9

Thanks for your patient explanation. :smiley: I can probably understand it now.

I now think nn.Embedding optimizes for embedding(index to vector). We can use nn.Linear to achieve the same result, but if we use nn.Embedding, it can have less computation because b-p wouldn’t happen on the entire matrix.

1 Like
(n0obcoder) #10

its always good to see a person getting satisfied with your explaination :smiley: