C++ forward 3x slower than python for traced model

waleedfarrukhgini · August 8, 2019, 1:13pm

I have a traced model that has a big difference in runtimes for torch::jit::script::Module forward function.

For Python (also the traced model), the time ranges from 0.4-0.6 seconds per forward pass
For C++, time ranges from 1.2-1.6 seconds which is almost 3x slower.

The model is multi-task(multi-output) and little big. It is an encoder-decoder based on UNet architecture for semantic segmentation but also has an classifier network (Conv+LinearLayers) after the encoding layer ends.

I also tested it for my other models e.g. Alexnet based Image classification and python and c++ runtimes are comparable there (c++ is faster).

I use Pytorch 1.1.0

Important note:

I have already turned deactivate the AutoGrad by torch::NoGradGuard no_grad.
I measure the exact time only for the forward pass and without any tensor manipulation.

Is this an issue maybe because of the multi-task architecture?

yf225 · August 19, 2019, 12:17am

Is this model running on CPU? If so, this might be related to OpenMP configuration difference, and would you like to try out the suggestions in https://github.com/pytorch/pytorch/issues/20156 ?

cyanM · August 20, 2019, 6:32am

could you share related code?

waleedfarrukhgini · August 26, 2019, 2:35pm

Sorry for the late reply. I solved the problem with pruning my model.

yf225 · August 26, 2019, 2:47pm

@waleedfarrukhgini Do you mind sharing more on how you prune your model? It would help other community members who face the same problem. Thanks!

waleedfarrukhgini · September 4, 2019, 12:52pm

I did basically a mixture of 4 things

Reduce number of channels per Convolutional Layer. I kept the depth of the network as it is since I thought going deep was the key here. But I thought I could reduce the number of channels per layer and still get the same performance.
Remove the Cropping part when combining decoder and encoder. I made sure the sizes matched.
Do Global Average pooling of the feature map for the classifier part.
Reduce the number of Linear Layers in classification part.

I am not sure how much each point affects the time performance, but I was able to reduce inference time on CPU with fixed 1 thread from 1.2-1.6 seconds to 0.2-0.3 seconds

yf225 · September 4, 2019, 2:34pm

@waleedfarrukhgini Thanks a lot for the suggestions and they are really useful. If I were to reproduce the original slowdown problem, do you recommend any minimal example that I can use? I would like to get to the bottom of this because running traced model in C++ should never be slower than in Python.

magicly · February 20, 2020, 9:26am

I post my experience on https://github.com/pytorch/pytorch/issues/20156. seems like the CUDA version libtorch have sth wrong with cpu?

I test code in this tutorial: https://pytorch.org/tutorials/advanced/cpp_export.html , on my ubuntu18.04 with 6 hard cpu cores and a RTX2080. it cost 80s to finished 1000 model.forward in c++ code, but just 25s in python code! I can’t call at::init_num_threads, because it complains: error: 'init_num_threads' is not a member of 'at'. what’s the most strange I think is when I run c++ code, cpu is 1000%high, but when I run python code, cpu is 600% high!

I searched for a long time, but it doesn’t work.

But when I did as the tutorial: https://pytorch.org/cppdocs/installing.html, download libtorch from https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip, it works!!! the cpu is 1200% when I run c++ code, same fast as python code: 24s!

At first, I used this download from pytorch download page:
https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.4.0.zip
So I think maybe this download just work for CUDA? And I test model.forward on CUDA in cpp, and in python, everything goes as expected: cpu is 100%, and every fast: 2.4s for 1000 forwards both in c++ and python, 10 times faster than the cpu version.

So I think, the problem is why the CUDA version libtorch is “so strange” on cpu? By “strange” I mean, cpu is 1000% high(not 600%, not 1200%) and 3 times slower than python version.

magicly · February 20, 2020, 10:23am

https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.4.0%2Bcpu.zip
this version works as the CUDA version.

https://download.pytorch.org/libtorch/nightly/cpu/libtorch-cxx11-abi-shared-with-deps-latest.zip this is ok.