TorchScript model with C++ is much slower than pytorch and ONNX

I tried 3 ways to run a torch.nn.GRU model on a cpu. The model is like
model = nn.GRU(512, 256, batch_first=True, bidirectional=True)

  1. run with pytorch; 2. convert to TorchScript and run with C++; 3 convert to ONNX and run with python

Each test was run 100 times to get an average number. The result is TorchScript with C++ is much slower than the others. Pytorch and ONNX only take about 40ms to run, but c++ takes about 120ms!

All these 3 ways use same mkldnn backend, such big performance seems not expected. Does any one know why and how to improve the performance in C++?

Thanks very mych!

Did you build pytorch with mkl ? I would suggest building with the MKL includes and sos you can get in a conda environment. MKL is much faster than eigen