Batch inference processing not quicker in compiled model

I was able to speed up our inference process on CPU significantly by sending a batch of input to inference at once, rather than inference on one by one. However, once the pytorch code is embedded in C++ all the time improvements are lost, and even slightly worse than inference on one data point at a time.

Multiprocessing also didn’t work once embedded in C++.

Any ideas of what is holding up the inference in the compiled code?

Solved: the discrepancy was due to our development environment running on torch, and our delivery environment using pytorch. While pytorch is faster than torch, the inference batch is no faster using pytorch than one by one inference, whereas in torch there is a significant speed up.