[RESOLVED] Problem with skip gram model

Adoni · May 27, 2017, 12:53pm

Hi everyone,
I start using pytorch today. However, I find it’s too slow in the skip-gram model I implement. I compare it with the same program implemented with DyNet. The result shows DyNet is 1000 times faster. I don’t believe that!
So I want to ask if anyone has implemented skip-gram model that is as fast as the existing tools (e.g. word2vec).

Any suggestions are welcomed!

=============SOLUTION===============

I solved this problem. It’s caused by the using of param momentum, which will cause the dense update of the optimizer (SGD), instead of the sparse update.

The code is here

smth · May 28, 2017, 4:53am

the key to getting speed in these cases is to use sparse gradients. for example nn.Embedding(..., sparse=True). Otherwise, a full dense gradient is computed in each backward, that slows it down significantly.

Adoni · May 28, 2017, 5:10am

Thanks for your reply!
However, even though I set sparse=True, there is also a great gap between the speed of pytorch and original word2vec (30,000 pairs/s V.S. 100 pairs/s). I’m wondering if there are other common mistakes that caused the slow speed. Of course, I will continue to find the bugs in my program and I will share the mistake if I find that.

Also, I find the meaning of params “sparse” is not included in the official document. Maybe someone should update it.

smth · May 28, 2017, 5:12am

i’m not entirely surprised, word2vec is a specialized code-base for a particular model.
But it’d be interesting to know why the pytorch program is this slow. It could either be user mistake, or something slow in our core library. If it is slowness in the core library, I’m happy to speed it up.

Adoni · May 28, 2017, 5:18am

I have implemented a word2vec with pure python code and find it’s not this slow. I think that may be caused by my mistake.

ehsanmok · May 28, 2017, 2:35pm

I suggest posting your code in github gist for example, or profile it yourself.

Adoni · June 2, 2017, 9:28am

Hi, I’m sorry to reply so late.
I post my program on GitHub, https://github.com/Adoni/naive_network_embedding.
Usage and problems about this program are shown on README. Any suggestion is welcomed.

Adoni · June 2, 2017, 9:34am

Hi, I post my program here https://github.com/Adoni/naive_network_embedding. I find the speed will slow down when I use backward() and step() function. I have been debugging for several days but I cannot find bugs anymore.

Adoni · June 23, 2017, 1:54am

Hi @ehsanmok, I have posted my code on GitHub, could you please help me check that?

ehsanmok · June 23, 2017, 4:48am

There’re number of potential things that make your code very slow when you’ve big graphs:

Your graph object is very big and you’re storing the nodes and edges in lists!
You are not using the batched computation.
negative_sampling is probing a list which is also slow for huge lists.

line_profiler is pointing you to where the issues are.

Adoni · June 27, 2017, 1:32am

Thanks for your reply, I have updated my code by using batch. The speed of forward() and backward() become faster. However, the speed of optimizer.step() is still slow.

Adoni · June 28, 2017, 3:25pm

Hi guys, I have solved this problem. It’s caused by the using of param momentum, which will cause the dense update of the optimizer (SGD), instead of the sparse update.

lanting · October 24, 2017, 3:06am

Hi, Adoni.
I tested your code. I find the speed can be further improved (improved nearly 10 times). I wrote a new version based on your code and post the code in github https://github.com/fanglanting/skip-gram-pytorch. Any suggestion is welcomed.

Adoni · November 8, 2017, 5:05am

Hi,

Thanks so much for your code. I’m sorry that I didn’t check my account and reply so late. Could you tell me which kind of technology you use to accelerate it? Because I find I have to pay more time to understand your code.

Good luck.
Xiaofei

Adoni · November 8, 2017, 5:15am

Hi,
I find the most important difference may be:

pos_u, pos_v, neg_v = self.op.generate_batch(self.windows_size, self.batch_size, self.neg_sample_num)

and

neg_score = torch.bmm(neg_embed_v, embed_u.unsqueeze(2)).squeeze()

right?

Adoni · November 8, 2017, 5:43am

Hi lanting,
I figured it out and yes it accelerates the running speed. But I only get the 2X acceleration. I will do more to make it faster.
Thanks a lot!

lanting · November 9, 2017, 6:11am

Hi, Adoni,
The most important difference is
neg_v = np.random.choice(self.sample_table, size=(batch_size2window_size,count))
and
neg_score = torch.bmm(neg_embed_v, embed_u.unsqueeze(2)).squeeze()
Sampling all negative samples at once is much faster than sampling negative samples one by one. Meanwhile, torch.bmm() accelerates the running speed compared with torch.mul().

Adoni · November 9, 2017, 4:07pm

Thanks!
But I think why torch.bmm() is faster than torch.mul() is that we save the time of lookup embedding of neg_v. Because before I find the bottleneck is the lookup operation.