[RESOLVED] Problem with skip gram model

Hi everyone,
I start using pytorch today. However, I find it’s too slow in the skip-gram model I implement. I compare it with the same program implemented with DyNet. The result shows DyNet is 1000 times faster. I don’t believe that!
So I want to ask if anyone has implemented skip-gram model that is as fast as the existing tools (e.g. word2vec).

Any suggestions are welcomed!


I solved this problem. It’s caused by the using of param momentum, which will cause the dense update of the optimizer (SGD), instead of the sparse update.

The code is here

the key to getting speed in these cases is to use sparse gradients. for example nn.Embedding(..., sparse=True). Otherwise, a full dense gradient is computed in each backward, that slows it down significantly.

Thanks for your reply!
However, even though I set sparse=True, there is also a great gap between the speed of pytorch and original word2vec (30,000 pairs/s V.S. 100 pairs/s). I’m wondering if there are other common mistakes that caused the slow speed. Of course, I will continue to find the bugs in my program and I will share the mistake if I find that.

Also, I find the meaning of params “sparse” is not included in the official document. Maybe someone should update it.

i’m not entirely surprised, word2vec is a specialized code-base for a particular model.
But it’d be interesting to know why the pytorch program is this slow. It could either be user mistake, or something slow in our core library. If it is slowness in the core library, I’m happy to speed it up.

I have implemented a word2vec with pure python code and find it’s not this slow. I think that may be caused by my mistake.

I suggest posting your code in github gist for example, or profile it yourself.

1 Like

Hi, I’m sorry to reply so late.
I post my program on GitHub, https://github.com/Adoni/naive_network_embedding.
Usage and problems about this program are shown on README. Any suggestion is welcomed.

Hi, I post my program here https://github.com/Adoni/naive_network_embedding. I find the speed will slow down when I use backward() and step() function. I have been debugging for several days but I cannot find bugs anymore.

Hi @ehsanmok, I have posted my code on GitHub, could you please help me check that?

There’re number of potential things that make your code very slow when you’ve big graphs:

  1. Your graph object is very big and you’re storing the nodes and edges in lists!
  2. You are not using the batched computation.
  3. negative_sampling is probing a list which is also slow for huge lists.

line_profiler is pointing you to where the issues are.

Thanks for your reply, I have updated my code by using batch. The speed of forward() and backward() become faster. However, the speed of optimizer.step() is still slow. :disappointed:

Hi guys, I have solved this problem. It’s caused by the using of param momentum, which will cause the dense update of the optimizer (SGD), instead of the sparse update.

Hi, Adoni.
I tested your code. I find the speed can be further improved (improved nearly 10 times). I wrote a new version based on your code and post the code in github https://github.com/fanglanting/skip-gram-pytorch. Any suggestion is welcomed.

1 Like


Thanks so much for your code. I’m sorry that I didn’t check my account and reply so late. Could you tell me which kind of technology you use to accelerate it? Because I find I have to pay more time to understand your code.

Good luck.

I find the most important difference may be:

pos_u, pos_v, neg_v = self.op.generate_batch(self.windows_size, self.batch_size, self.neg_sample_num)


neg_score = torch.bmm(neg_embed_v, embed_u.unsqueeze(2)).squeeze()


Hi lanting,
I figured it out and yes it accelerates the running speed. But I only get the 2X acceleration. I will do more to make it faster.
Thanks a lot!

Hi, Adoni,
The most important difference is
neg_v = np.random.choice(self.sample_table, size=(batch_size2window_size,count))
neg_score = torch.bmm(neg_embed_v, embed_u.unsqueeze(2)).squeeze()
Sampling all negative samples at once is much faster than sampling negative samples one by one. Meanwhile, torch.bmm() accelerates the running speed compared with torch.mul().

But I think why torch.bmm() is faster than torch.mul() is that we save the time of lookup embedding of neg_v. Because before I find the bottleneck is the lookup operation.