Is it possible to speed up the inference on multi-core CPU machine with DDP

I have a pre-trained transformer model (say LayoutLMv2). I am trying to build a real time API where I have to do about 50 separate inferences on this model (50 images
from a document).
I am trying to speed up the API without having to deploy it on GPU.
Is it possible to parallelize this with DDP and have a better response time if I am using a multi-core CPU machine?
Are there any practical examples for speeding up inference alone (of any torch models) on a CPU machine with DDP?

For a CPU machine, you may want to check the existing utilization of the machine without DDP as it is likely that it is already parallelized across the batch dimension by e.g., OneDNN/MKLDNN. If this is the case I would not expect DDP to offer any speedup as it would essentially add a layer of communication and indirection when one is not needed. You can likely also tweak the amount of parallelization via e.g., OMP_NUM_THREADS and I believe it is a common practice to set this to the number of physical CPU cores (rather than total core count because vector units are typically shared across SMT lanes) to maximize throughput.

1 Like