C++ forward 3x slower than python for traced model

I have a traced model that has a big difference in runtimes for torch::jit::script::Module forward function.

For Python (also the traced model), the time ranges from 0.4-0.6 seconds per forward pass
For C++, time ranges from 1.2-1.6 seconds which is almost 3x slower.

The model is multi-task(multi-output) and little big. It is an encoder-decoder based on UNet architecture for semantic segmentation but also has an mage classifier network after the encoding layer ends.

I also tested it for my other models e.g. Alexnet based Image classification and python and c++ runtimes are comparable there (c++ is faster).

I use Pytorch 1.1.0

Important note:

  1. I have already turned deactivate the AutoGrad by torch::NoGradGuard no_grad.
  2. I measure the exact time only for the forward pass and without any tensor manipulation.

Is this an issue maybe because of the multi-task architecture?

1 Like

Is this model running on CPU? If so, this might be related to OpenMP configuration difference, and would you like to try out the suggestions in https://github.com/pytorch/pytorch/issues/20156 ?

could you share related code?