C++ forward 3x slower than python for traced model

I have a traced model that has a big difference in runtimes for torch::jit::script::Module forward function.

For Python (also the traced model), the time ranges from 0.4-0.6 seconds per forward pass
For C++, time ranges from 1.2-1.6 seconds which is almost 3x slower.

The model is multi-task(multi-output) and little big. It is an encoder-decoder based on UNet architecture for semantic segmentation but also has an classifier network (Conv+LinearLayers) after the encoding layer ends.

I also tested it for my other models e.g. Alexnet based Image classification and python and c++ runtimes are comparable there (c++ is faster).

I use Pytorch 1.1.0

Important note:

  1. I have already turned deactivate the AutoGrad by torch::NoGradGuard no_grad.
  2. I measure the exact time only for the forward pass and without any tensor manipulation.

Is this an issue maybe because of the multi-task architecture?

1 Like

Is this model running on CPU? If so, this might be related to OpenMP configuration difference, and would you like to try out the suggestions in https://github.com/pytorch/pytorch/issues/20156 ?

could you share related code?

Sorry for the late reply. I solved the problem with pruning my model.

@waleedfarrukhgini Do you mind sharing more on how you prune your model? It would help other community members who face the same problem. Thanks!

I did basically a mixture of 4 things

  1. Reduce number of channels per Convolutional Layer. I kept the depth of the network as it is since I thought going deep was the key here. But I thought I could reduce the number of channels per layer and still get the same performance.
  2. Remove the Cropping part when combining decoder and encoder. I made sure the sizes matched.
  3. Do Global Average pooling of the feature map for the classifier part.
  4. Reduce the number of Linear Layers in classification part.

I am not sure how much each point affects the time performance, but I was able to reduce inference time on CPU with fixed 1 thread from 1.2-1.6 seconds to 0.2-0.3 seconds

@waleedfarrukhgini Thanks a lot for the suggestions and they are really useful. If I were to reproduce the original slowdown problem, do you recommend any minimal example that I can use? I would like to get to the bottom of this because running traced model in C++ should never be slower than in Python.