Hi everyone,
I start using pytorch today. However, I find it’s too slow in the skip-gram model I implement. I compare it with the same program implemented with DyNet. The result shows DyNet is 1000 times faster. I don’t believe that!
So I want to ask if anyone has implemented skip-gram model that is as fast as the existing tools (e.g. word2vec).
Any suggestions are welcomed!
=============SOLUTION===============
I solved this problem. It’s caused by the using of param momentum, which will cause the dense update of the optimizer (SGD), instead of the sparse update.
the key to getting speed in these cases is to use sparse gradients. for example nn.Embedding(..., sparse=True). Otherwise, a full dense gradient is computed in each backward, that slows it down significantly.
Thanks for your reply!
However, even though I set sparse=True, there is also a great gap between the speed of pytorch and original word2vec (30,000 pairs/s V.S. 100 pairs/s). I’m wondering if there are other common mistakes that caused the slow speed. Of course, I will continue to find the bugs in my program and I will share the mistake if I find that.
Also, I find the meaning of params “sparse” is not included in the official document. Maybe someone should update it.
i’m not entirely surprised, word2vec is a specialized code-base for a particular model.
But it’d be interesting to know why the pytorch program is this slow. It could either be user mistake, or something slow in our core library. If it is slowness in the core library, I’m happy to speed it up.
Hi, I’m sorry to reply so late.
I post my program on GitHub, https://github.com/Adoni/naive_network_embedding.
Usage and problems about this program are shown on README. Any suggestion is welcomed.
Hi, I post my program here https://github.com/Adoni/naive_network_embedding. I find the speed will slow down when I use backward() and step() function. I have been debugging for several days but I cannot find bugs anymore.
Thanks for your reply, I have updated my code by using batch. The speed of forward() and backward() become faster. However, the speed of optimizer.step() is still slow.
Hi guys, I have solved this problem. It’s caused by the using of param momentum, which will cause the dense update of the optimizer (SGD), instead of the sparse update.
Hi, Adoni.
I tested your code. I find the speed can be further improved (improved nearly 10 times). I wrote a new version based on your code and post the code in github https://github.com/fanglanting/skip-gram-pytorch. Any suggestion is welcomed.
Thanks so much for your code. I’m sorry that I didn’t check my account and reply so late. Could you tell me which kind of technology you use to accelerate it? Because I find I have to pay more time to understand your code.
Hi lanting,
I figured it out and yes it accelerates the running speed. But I only get the 2X acceleration. I will do more to make it faster.
Thanks a lot!
Hi, Adoni,
The most important difference is
neg_v = np.random.choice(self.sample_table, size=(batch_size2window_size,count))
and
neg_score = torch.bmm(neg_embed_v, embed_u.unsqueeze(2)).squeeze()
Sampling all negative samples at once is much faster than sampling negative samples one by one. Meanwhile, torch.bmm() accelerates the running speed compared with torch.mul().
Thanks!
But I think why torch.bmm() is faster than torch.mul() is that we save the time of lookup embedding of neg_v. Because before I find the bottleneck is the lookup operation.