Hi I have a torchscripted transformer model and I’m trying to evaluate the performance of my C++ inference code. To this end, I’m running on a server with 32 CPUs (and doing inference on CPU). I have set OMP_NUM_THREADS=1 and MKL_NUM_THREADS=1 but I still notice that running 30 inference processes in parallel causes each forward step to be significantly slower (up to 2x slower) than running inference jobs one at a time.
Is there something else I’m missing?