I have a traced model that has a big difference in runtimes for torch::jit::script::Module forward function.
For Python (also the traced model), the time ranges from 0.4-0.6 seconds per forward pass
For C++, time ranges from 1.2-1.6 seconds which is almost 3x slower.
The model is multi-task(multi-output) and little big. It is an encoder-decoder based on UNet architecture for semantic segmentation but also has an classifier network (Conv+LinearLayers) after the encoding layer ends.
I also tested it for my other models e.g. Alexnet based Image classification and python and c++ runtimes are comparable there (c++ is faster).
I use Pytorch 1.1.0
- I have already turned deactivate the AutoGrad by
- I measure the exact time only for the forward pass and without any tensor manipulation.
Is this an issue maybe because of the multi-task architecture?