My Libtorch binary built from source is slower than official binary

(Yao Zihang) #1

I find that matrix multiplication is slower in C++ API, so I write the same code in C++ and python and record their execution times, code is as following:

C++:

#include<torch/torch.h>
#include<iostream>
#include <chrono>

int main(){
	torch::Tensor tensor = torch::randn({2708, 1433});
	torch::Tensor weight = torch::randn({1433, 16});
	auto start = std::chrono::high_resolution_clock::now();
	tensor.mm(weight);
	auto end = std::chrono::high_resolution_clock::now();
	std::cout<< "C++ Operation Time(s) " << std::chrono::duration<double>(end - start).count() << "s" << 	std::endl;
	return 0;
}

Result:

C++ Operation Time(s) 0.082496s

python:

import torch
import torch.nn as nn
import torch.nn.functional as F

tensor = torch.randn(2708, 1433)
weight = torch.randn(1433, 16)
t0 = time.time()
tensor.mm(weight)
t1 = time.time()
print("Python Operation Time(s) {:.4f}".format(t1 - t0))

Result:

Python Operation Time(s) 0.0114

Testing Environment:

ubuntu 16.04
gcc version 5.4.0
python version 3.7.3
pytorch version 1.0.1

I think it’s not a small difference, why is it happen??

1 Like
#2

I’m not getting the same results. By running multiple times I get:

C++ Operation Time(s) 0.00395173s
Python Operation Time(s) 0.0882

I’m using the version 1.1.0.dev20190506 of libtorch.

(Yao Zihang) #3

I think the execution time of these two should be similar. I also find the execution time of C++ is not stable, sometimes 0.02xx, sometimes 0.003xx…

#4

The execution time here might depend on the current state of your CPU. Anyway, I don’t think that the C++ API is slower than the Python’s for this operation (which is the object of this post). I’d guess, like you, that the execution times should be pretty close, I don’t see any significant Python overhead here.

(Will Feng) #5

The execution time of CPU models is subject to the CPU load at that moment (e.g. if there are background tasks running on the OS, the execution time is longer).

Also we recommend setting a few of OpenMP environment variables for optimal CPU performance: https://github.com/mingfeima/convnet-benchmarks/blob/master/pytorch/run.sh#L16-L25.

(Intel Novel) #6

Maybe you used Debug mode C++

(Yao Zihang) #7

Finally I find the main reason is that I use the libtorch built from source by myself, which is slower than official libtorch significantly.

Is there an official guide about how to build libtorch from source?

(Martin Huber) #8

if you run the setup.py to install Pytorch, then the libtorch will also be built. Here is a little guide on how to build libtorch in a clean anaconda environment. Using it, I also experience that the Python API is ~2x faster. Although it is not as significant, I also wonder where that speed up comes from, since this is the official way to build Pytorch

Python Operation Time(s) 0.0018
C++ Operation Time(s) 0.00416361
(Yao Zihang) #9

I follow you guide and encounter such error when making:

undefined reference to symbol 'omp_get_num_threads@@OMP_1.0'
//home/allen/miniconda3/envs/pytorch/lib/libgomp.so.1: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

Any advice?

(Martin Huber) #10

Maybe you are missing packages in miniconda