My Libtorch binary built from source is slower than official binary

I find that matrix multiplication is slower in C++ API, so I write the same code in C++ and python and record their execution times, code is as following:

C++:

#include<torch/torch.h>
#include<iostream>
#include <chrono>

int main(){
	torch::Tensor tensor = torch::randn({2708, 1433});
	torch::Tensor weight = torch::randn({1433, 16});
	auto start = std::chrono::high_resolution_clock::now();
	tensor.mm(weight);
	auto end = std::chrono::high_resolution_clock::now();
	std::cout<< "C++ Operation Time(s) " << std::chrono::duration<double>(end - start).count() << "s" << 	std::endl;
	return 0;
}

Result:

C++ Operation Time(s) 0.082496s

python:

import torch
import torch.nn as nn
import torch.nn.functional as F

tensor = torch.randn(2708, 1433)
weight = torch.randn(1433, 16)
t0 = time.time()
tensor.mm(weight)
t1 = time.time()
print("Python Operation Time(s) {:.4f}".format(t1 - t0))

Result:

Python Operation Time(s) 0.0114

Testing Environment:

ubuntu 16.04
gcc version 5.4.0
python version 3.7.3
pytorch version 1.0.1

I think it’s not a small difference, why is it happen??

1 Like

I’m not getting the same results. By running multiple times I get:

C++ Operation Time(s) 0.00395173s
Python Operation Time(s) 0.0882

I’m using the version 1.1.0.dev20190506 of libtorch.

I think the execution time of these two should be similar. I also find the execution time of C++ is not stable, sometimes 0.02xx, sometimes 0.003xx…

The execution time here might depend on the current state of your CPU. Anyway, I don’t think that the C++ API is slower than the Python’s for this operation (which is the object of this post). I’d guess, like you, that the execution times should be pretty close, I don’t see any significant Python overhead here.

The execution time of CPU models is subject to the CPU load at that moment (e.g. if there are background tasks running on the OS, the execution time is longer).

Also we recommend setting a few of OpenMP environment variables for optimal CPU performance: https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25.

Maybe you used Debug mode C++

Finally I find the main reason is that I use the libtorch built from source by myself, which is slower than official libtorch significantly.

Is there an official guide about how to build libtorch from source?

if you run the setup.py to install Pytorch, then the libtorch will also be built. Here is a little guide on how to build libtorch in a clean anaconda environment. Using it, I also experience that the Python API is ~2x faster. Although it is not as significant, I also wonder where that speed up comes from, since this is the official way to build Pytorch

Python Operation Time(s) 0.0018
C++ Operation Time(s) 0.00416361

I follow you guide and encounter such error when making:

undefined reference to symbol 'omp_get_num_threads@@OMP_1.0'
//home/allen/miniconda3/envs/pytorch/lib/libgomp.so.1: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

Any advice?

Maybe you are missing packages in miniconda

Yes, I met the same, have you solve this?

You are missing the warm start part. Before you take the main loop into account , you shall take a 5 or 10 circle loop to warm start the CPU/GPU.

1 Like

I also meet your problem.I use your Cpp code ,firstly,it has two times that need 0.02 s,then,it become to 0.003s.
Also ,the reason why I see this website is that i load a TorchScript torch::jit::load( ),and I find that sometimes,it need 300us in forward,sometimes,it need 30000 or more us in forward.However,I did not build from source.
environment
libtorch 1.2
i7-8750H

In my code, I only do the first inference, then obtain network output from the exe file. Is there a way to make the first inference faster if I run the exe file multiple times?

Mybe this issue can help: https://github.com/pytorch/pytorch/issues/20156

A side note: when you benchmark libtorch against python, please use release version instead of the DEBUG version and the program can be optimized with optimization flag -O3.