Inter- and intra-op difference

Hi fellow libtorchers! We are working on a project solely using the C++ API. Recently we started profiling our models, running on CPU, as we want to get a feel for the parallelization on CPU before moving to GPU. It seems that libtorch provides two main parallelization options that the user can specify when computing on CPU: the intra-op thread count and the inter-op thread count. After some testing, we are still unsure where difference between the inter- and intra- operations lie. As we moniter the time on our machine, we see a performance difference when specifying intra-op thread count, while specifying the inter-op thread count seems to make no difference. So we are wondering what the difference between these two are, how one can utilize the inter-op thread pool, or if you have some nice resources on this topic! :slight_smile: Some of the articles we looked at: CPU threading and TorchScript inference — PyTorch 1.9.0 documentation Intra- and inter-operator parallelism in PyTorch (work plan) · Issue #19002 · pytorch/pytorch · GitHub

These docs might be interesting additionally and you could check the posted examples and compare them to your current workflow.

Thanks for the reply, I work on a project together with Jim. I have read trough the docs you linked, but I am not sure if I understand how the interop threads are used. To make things clear, we are creating a model procedurally inside c++, right now we are working on simple feed forward nets, where we feed linear modules of varying size and activation functions into a sequential, and use the forward method of sequential module to run inference. We are testing on different shapes of the network and different sizes of the data. We can see clear performance difference when setting different number of intraop threads, but basically none with different interop threads. The docs say “PyTorch uses a single thread pool for the inter-op parallelism, this thread pool is shared by all inference tasks that are forked within the application process.”, now the thing I struggle to understand is how does one “fork” a new thread that uses the interop threads form libtorch (I can see in pytorch example you use a _fork() function from jit, but I cant find anything similar in libtorch)? Or does the forking happen somehow automatically inside libtorch code? Another thing the docs say is "One or more inference threads execute a model’s forward pass on the given inputs. Each inference thread invokes a JIT interpreter that executes the ops of a model inline, one by one. ", again, sorry for my lack of understanding, what does the “one or more inference thread…” mean here, do I need to fork the inference threads my self or does libtorch do it for me? Also it says that that each inference thread invokes JIT interpreter, from my understanding JIT interpreter is used to run TorchScript models, but when I have a model inside c++ in form of a sequential module, is the JIT interpreter still being invoked?

Hi @Bohaterowicz, here is some of my understanding:

  1. Forking will not happen automatically. You need to manually insert fork and wait just as the document mentions.
  2. Libtorch already fork the inference threads based on num_interop_threads. But you need to invoke this thread manually(push your function into the task queue).

Does the number of interop threads have any effect during training?