[JIT] Inference time in C++ is same as python

Hi, I have an issue in deploying yolov3 model to work fast in c++. I am working on this implementation: https://github.com/eriklindernoren/PyTorch-YOLOv3
I used ‘torch.jit.trace’ to save the model as following:

    example_img = torch.rand(1,3,416,416)
    with torch.jit.optimized_execution(True):
    	traced_script_module = torch.jit.trace(model, example_img)# save the converted model
    	traced_script_module.save("yolov3.pt")

I loaded the file in C++ code using torch::jit::load as following:

  torch::jit::script::Module module;
  module = torch::jit::load("/home/bahey/dev_ws/src/testorch/yolov3.pt");
  torch::NoGradGuard no_grad_guard;
  at::init_num_threads();

I calculated inference time in python and c++. I got the same inference time in both. I was expecting that C++ should be faster. My question is how can I make the model’s inference time faster in C++?

#Environment:

PyTorch Version 1.5
OS (e.g., Linux): Ubuntu 18.04
Libtorch version: Nightly cxx11 ABI
Python version:3.6
1 Like

Did you build c++ from source ? If so I would make sure you included cuDNN (make sure it compliments your cuda version) and mkl for your build. Using 1.3.0, I get a 5-6X speed up. Would imagine this would be similar in 1.5.0, if not better

Hi @copythatpasta, Thanks for your response.
I use the compiled version of libtorch from https://pytorch.org/get-started/locally/
Also, I want to mention that I use CPU for inference. I get 5-6x speed when I try this on a simple model trained on MNIST data. But when I tried to script YOLOv3 model and use it in C++ I got the same inference time.

I know if you do not include mkl headers and includes, pytorch builds with eigen to interface with BLAS. Eigen is a lot slower, I would try building libtorch from source using mkl in a conda environment to compare. I am not sure what the prebuilt libs use.

What you see is two things

  • Even before the JIT, and for “real world” models, the speedup of C++ over Python is smaller than people think as the Python overhead is “fixed cost”. For the LLTM tutorial example, I measured this to be ~10%. For convolutional models like ResNet, it should be less.
  • One would expect Python -> C++ to be more relevant at speeding things up when you have Python in a “hot loop”. But then you also have Tensors in a hot loop, which bring significant overhead to the point where you’re doing something wrong.
  • The JIT will remove the Python overheads (including GIL) within the execution of the model, also when called from Python. ONNXRuntime works similar.

The other truth is that you can get further speedups by combining (“fusing”) certain operations to avoid writing intermediates to memory. The pointwise part of an LSTM cell is the classic example. For this you need a “holistic” view of the computation to be able to do optimization across operators. This is something that the JIT has been doing to some extend (mostly on CUDA) and is the goal of ONNXRuntime/TVM and current PyTorch JIT developments. It would be, however, something that is again independent of C++/Python to invoke the calculation.

Best regards

Thomas

1 Like

Hi:
could you pls. provide complete code for converting model by trace ? when I do this , many problems happended. I use this repro https://github.com/eriklindernoren/PyTorch-YOLOv3.

Thanks so much !