I was able to speed up our inference process on CPU significantly by sending a batch of input to inference at once, rather than inference on one by one. However, once the pytorch code is embedded in C++ all the time improvements are lost, and even slightly worse than inference on one data point at a time.
Multiprocessing also didn’t work once embedded in C++.
Any ideas of what is holding up the inference in the compiled code?