Slowdown when running multiple processes doing inference in parallel

Hi I have a torchscripted transformer model and I’m trying to evaluate the performance of my C++ inference code. To this end, I’m running on a server with 32 CPUs (and doing inference on CPU). I have set OMP_NUM_THREADS=1 and MKL_NUM_THREADS=1 but I still notice that running 30 inference processes in parallel causes each forward step to be significantly slower (up to 2x slower) than running inference jobs one at a time.

Is there something else I’m missing?

@Prashant I met the same question.Did you solve it?