I can parallelize computation over the batch dimension in my C++ extension using at::parallel_for, however this seems to require compiling with OpenMP enabled (“-fopenmp” on GCC). I am reluctant to do this as the wheel will be used by many different users, some of whom might be using a version of PyTorch compiled with a different OpenMP library, or TBB, which I worry might cause conflicts. (I had an issue in the past where I did something like this, but then PyTorch changed the OpenMP library used on Windows and my Windows users started getting a warning message about two different OpenMP libraries being loaded.)
Another option is to move the multithreading up into the call from Python, using multiple threads to call the C++ code for different ranges of the batch dimension. I don’t think PyTorch provides any facility for this, but I could probably use Python’s own ThreadPoolExecutor.
I would have imagined that this would be a common issue, so I am surprised that it is not discussed in the tutorials and that I haven’t found much about it elsewhere. Is ThreadPoolExecutor a good approach, or is there some other option, preferably a PyTorch facility, to enable portable multithreaded execution over the batch dimension in C++ extensions?