Is a non-automatically-scaling OpenMP thread-pool better than an automatically-scaling one?

imaginary · February 25, 2021, 7:00pm

By default, PyTorch has a fixed-sized OpenMP thread-pool (equal to number of physical cores on a processor). Most PyTorch users who run multiple PyTorch jobs concurrently on CPU probably use taskset and/or numactl to set the particular CPU cores their PyTorch jobs can use, in order to saturate all cores of their system. Doing so also avoids slowdown due to OpenMP synchronization for operations that should not be parallelized.

Let’s consider an hypothetical scenario. Say, a user assigned 8 cores to one of their PyTorch jobs.
If their PyTorch job’s workload also entails operations that are better suited towards being run on fewer cores than they assigned, should they programmatically scale the number of threads before and after such operations with torch.set_num_threads()? As in, would decreasing the number of OpenMP threads before such operations to run such operations on fewer cores and increasing them after those operations be efficient enough to amortize the thread destruction & creation cost? If yes, then a variable-sized thread-pool would automatically handle creating & destroying the number of threads in the OpenMP thread pool.

Talking about parallel_for in PyTorch, due to issue 24080, PR 26886 automatically scaled down the number of threads in the OpenMP thread pool, thus creating a variable-sized thread pool.

Because of issue 32008, the thread-pool was made fixed-sized again with PR 32875. The stated rationale for reverting to the omp parallel if style was to avoid synchronization when num_threads is 1. However, both omp parallel if & num_threads(num_threads) can be used together, and num_threads(1) doesn’t lead to OpenMP being used anyway. Anyway, was the original rationale (the very first commit of ParallelOpenMP.h) for omp parallel if without using num_threads was to use a fixed-sized thread pool?

Intuitively, it does seem to me that a fixed-sized thread pool would perform better than a variable-sized thread pool if taskset and/or numactl are used to run PyTorch jobs, but the benchmarks on pytorch/benchmarks are not compatible with the latest Python versions. I’m trying to learn about Systems design decisions of various systems & it’d be great to be able to learn why PyTorch initially had a fixed-sized OpenMP thread pool but for a period of time, was okay with a variable-sized thread-pool as well.

imaginary · February 26, 2021, 3:58am

It turns out that I misunderstood how num_threads works. It is only valid in its scope.