My Libtorch binary built from source is slower than official binary

EsdeathYZH · May 6, 2019, 10:32am

I find that matrix multiplication is slower in C++ API, so I write the same code in C++ and python and record their execution times, code is as following:

C++:

#include<torch/torch.h>
#include<iostream>
#include <chrono>

int main(){
	torch::Tensor tensor = torch::randn({2708, 1433});
	torch::Tensor weight = torch::randn({1433, 16});
	auto start = std::chrono::high_resolution_clock::now();
	tensor.mm(weight);
	auto end = std::chrono::high_resolution_clock::now();
	std::cout<< "C++ Operation Time(s) " << std::chrono::duration<double>(end - start).count() << "s" << 	std::endl;
	return 0;
}

Result:

C++ Operation Time(s) 0.082496s

python:

import torch
import torch.nn as nn
import torch.nn.functional as F

tensor = torch.randn(2708, 1433)
weight = torch.randn(1433, 16)
t0 = time.time()
tensor.mm(weight)
t1 = time.time()
print("Python Operation Time(s) {:.4f}".format(t1 - t0))

Result:

Python Operation Time(s) 0.0114

Testing Environment:

ubuntu 16.04
gcc version 5.4.0
python version 3.7.3
pytorch version 1.0.1

I think it’s not a small difference, why is it happen？？

LeviViana · May 6, 2019, 8:39pm

I’m not getting the same results. By running multiple times I get:

C++ Operation Time(s) 0.00395173s
Python Operation Time(s) 0.0882

I’m using the version 1.1.0.dev20190506 of libtorch.

EsdeathYZH · May 7, 2019, 5:06pm

I think the execution time of these two should be similar. I also find the execution time of C++ is not stable, sometimes 0.02xx, sometimes 0.003xx…

LeviViana · May 8, 2019, 9:38pm

The execution time here might depend on the current state of your CPU. Anyway, I don’t think that the C++ API is slower than the Python’s for this operation (which is the object of this post). I’d guess, like you, that the execution times should be pretty close, I don’t see any significant Python overhead here.

yf225 · May 8, 2019, 9:56pm

The execution time of CPU models is subject to the CPU load at that moment (e.g. if there are background tasks running on the OS, the execution time is longer).

Also we recommend setting a few of OpenMP environment variables for optimal CPU performance: https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25.

Intel_Novel · May 8, 2019, 10:23pm

Maybe you used Debug mode C++

EsdeathYZH · May 9, 2019, 9:31am

Finally I find the main reason is that I use the libtorch built from source by myself, which is slower than official libtorch significantly.

Is there an official guide about how to build libtorch from source?

mhubii · May 17, 2019, 8:09am

if you run the setup.py to install Pytorch, then the libtorch will also be built. Here is a little guide on how to build libtorch in a clean anaconda environment. Using it, I also experience that the Python API is ~2x faster. Although it is not as significant, I also wonder where that speed up comes from, since this is the official way to build Pytorch

Python Operation Time(s) 0.0018
C++ Operation Time(s) 0.00416361

EsdeathYZH · May 20, 2019, 3:43pm

I follow you guide and encounter such error when making:

undefined reference to symbol 'omp_get_num_threads@@OMP_1.0'
//home/allen/miniconda3/envs/pytorch/lib/libgomp.so.1: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

Any advice?

mhubii · May 20, 2019, 4:44pm

Maybe you are missing packages in miniconda

LJXLJXLJX · July 22, 2019, 9:14am

Yes, I met the same, have you solve this?

Zepyhrus · September 27, 2019, 3:30am

You are missing the warm start part. Before you take the main loop into account , you shall take a 5 or 10 circle loop to warm start the CPU/GPU.

eleven_jiang · November 28, 2019, 2:18pm

I also meet your problem.I use your Cpp code ,firstly,it has two times that need 0.02 s,then,it become to 0.003s.
Also ,the reason why I see this website is that i load a TorchScript torch::jit::load( ),and I find that sometimes,it need 300us in forward,sometimes,it need 30000 or more us in forward.However,I did not build from source.
environment
libtorch 1.2
i7-8750H

oelgendy · January 10, 2020, 2:11am

In my code, I only do the first inference, then obtain network output from the exe file. Is there a way to make the first inference faster if I run the exe file multiple times?

Jonson · March 10, 2020, 6:17am

Mybe this issue can help: https://github.com/pytorch/pytorch/issues/20156

Lin_Jia · August 22, 2020, 3:10am

A side note: when you benchmark libtorch against python, please use release version instead of the DEBUG version and the program can be optimized with optimization flag -O3.