Pytorch model inference performance with multi-threaded application

I am using libTorch (C++ lib) for inference job. My overall workload has a number of feature chunks which can potentially be parallelized to call inference. For example, lets say it has 100 feature chunks, then the way it looks will be like this

for(auto &feature_chunk:feature_chunk){
  auto out = model→forward(feature_chunk);
}

Please note this is an intuitive representation, not actual code. We are using 2 inter-op and 2 intra-op threads in our application. Running like the above way can cause underusage of hardware if I run it on 16 local cores (8 physical cores). To improve host usage and get better throughput for this batch application following strategies are tried out

  1. Split feature chunks vector to 1/8 and start running 8 such processes each gettting 1/8 of feature chunks. Each of the 8 processes uses 2 inter-op and 2 intra-op threads. This setup maximizes our CPU usage with each process uses 200% CPU (2 local cores) and provides best performance amoung all our setup. The downside is that each of the process loads model 8 times and hence memory consumption is 8 times. Please note that we do load some additional artifacts that helps in processing which becomes larger than PyTorch model used for inference. Hence loading it in 8 processes increases memory consumption.

  2. To reduce memory footprint of the process, we tried to run the same application in multi-threaded way which means all artifacts are loaded once. There are 8 application threads calling LibTorch “model-forward”. In this model we found libTorch’s global thread pool is 2 (inter-op and intra-op) which make the performance very bad.

  3. As an improvement from option-2, we allocated 16 intra-op and 8 inter-op threads (global thread pool of libTorch) along with 8 application threads. We found that this setup has 1000% CPU usage across all 16 logical cores. This setup is not giving same performance as Option-1.

  4. We experimented with different combinations of application threads and LibTorch global threads, but could not beat the performance of Option-1.

  5. Profiling did not help us pin point to exact issue. But we know that LibTorch’s global thread pool is not helping us allocate number of threads required for each application thread. In the case of multi-processing case(Option-1), we were able to achieve the same and hence got better performance

6. Following are the questions I would like to clarify
  1. Does LibTorch work good with multi-threaded applications like the one mentioned above where we can scale the job (provided we have CPU) linearly by adding more application threads to call model inference? If there is a way to achieve it, can you please point me to the right documentation/resource

  2. Is there a way to disable global thread pool and allocate Torch threads according to application threads?

  3. When we profiled we found multiple application threads calling model inference of PyTorch using Torch’s global thread pool results in some sort of contention/spinlock/mutex etc. Is the understanding correct?

  4. What are the recommended practices for multi-threaded application with LibTorch? Is it favoring multi-processing approach than multi-threaded?