I find that matrix multiplication is slower in C++ API, so I write the same code in C++ and python and record their execution times, code is as following:
The execution time here might depend on the current state of your CPU. Anyway, I don’t think that the C++ API is slower than the Python’s for this operation (which is the object of this post). I’d guess, like you, that the execution times should be pretty close, I don’t see any significant Python overhead here.
The execution time of CPU models is subject to the CPU load at that moment (e.g. if there are background tasks running on the OS, the execution time is longer).
if you run the setup.py to install Pytorch, then the libtorch will also be built. Here is a little guide on how to build libtorch in a clean anaconda environment. Using it, I also experience that the Python API is ~2x faster. Although it is not as significant, I also wonder where that speed up comes from, since this is the official way to build Pytorch
Python Operation Time(s) 0.0018
C++ Operation Time(s) 0.00416361
I follow you guide and encounter such error when making:
undefined reference to symbol 'omp_get_num_threads@@OMP_1.0'
//home/allen/miniconda3/envs/pytorch/lib/libgomp.so.1: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
I also meet your problem.I use your Cpp code ,firstly,it has two times that need 0.02 s,then,it become to 0.003s.
Also ,the reason why I see this website is that i load a TorchScript torch::jit::load( ),and I find that sometimes,it need 300us in forward,sometimes,it need 30000 or more us in forward.However,I did not build from source.
environment
libtorch 1.2
i7-8750H
In my code, I only do the first inference, then obtain network output from the exe file. Is there a way to make the first inference faster if I run the exe file multiple times?
A side note: when you benchmark libtorch against python, please use release version instead of the DEBUG version and the program can be optimized with optimization flag -O3.