Pytorch seems very slow on CPU

I implemented a word2vec (skip-gram with ns) using pytorch, but it’s running much much slower than the gensim version of word2vec.
gensim took like 2min to finish the training, whereas the pytorch version seems will take half a day though.

The setup of pytorch is on Macbook pro with only CPU, and I can see the CPU of the pytorch word2vec is 100% (should be only 1 core utilized).

Whereas gensim uses multiple processes to run the training, it’s using numpy and cython functions (release GIL).

So I wonder how could I optimize the pytorch implementation further to match the performance of gensim version?

Hi,

How did you install pytorch? If you install the provided binaries and don’t force it to run single threaded with torch.set_num_threads() or OMP_NUM_THREADS env variable, then it should use all available cores.

Hey @albanD, thanks for your quick reply.

I install it via conda install pytorch torchvision -c pytorch.

And I didn’t set the threads by torch.set_num_threads() or OMP_NUM_THREADS, so it should be the default value or setting.

BTW, set_num_threads() doesn’t work for me in any way, it’s always 1

In [2]: th.get_num_threads()
Out[2]: 1

In [3]: th.set_num_threads(2)

In [4]: th.get_num_threads()
Out[4]: 1

That would mean that your version of torch is compiled without multithreading support.
Which OS are you using? Do you see the same behavior if you install from pip?

Hey @albanD
macOS Catalina, version 10.15.6.

Yes, same behavior with pip, thought conda would be different, but conda turns out to be the same.

So how could I get multi-threading support?

Hi @ptrblck, could you please tell me how to get the multi-threading support for pytroch?

I’ve searched for a while, looks like I need to manually build and install from source?

Hi,

Sorry it took a bit of time to track it down.
We always had that limitation for mac binaries.
There has been some work towards this so you might want to try the nightly builds: https://github.com/pytorch/pytorch/issues/43036

1 Like

I forgot to leave a few notes here, just for the reference of other people (who might have the same doubts).

On macOS, here is the config of pytorch,

In [1]: print(th.__config__.parallel_info())
ATen/Parallel:
	at::get_num_threads() : 8
	at::get_num_interop_threads() : 8
OpenMP not found
Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
	mkl_get_max_threads() : 1
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
std::thread::hardware_concurrency() : 16
Environment variables:
	OMP_NUM_THREADS : [not set]
	MKL_NUM_THREADS : [not set]
ATen parallel backend: native thread pool

OpenMP is not found for macOS pytorch, and by default get_num_threads=8.
By setting MKL_NUM_THREADS=16, the get_num_threads can be changed as well, since macOS pytorch is using MKL.

So, by setting MKL_NUM_THREADS, more CPU could be utilized.

1 Like

Did you find any solutions to your problem. In my case pyTorch implementation of word2vec training takes at least 5 hours whereas gensim word2vec training takes only a couple of seconds using the Brown dataset. I’m new to PyTorch but I found this much difference quite surprising. I searched a lot but could not find a satisfactory answer yet.