By default, PyTorch has a fixed-sized OpenMP thread-pool (equal to number of physical cores on a processor). Most PyTorch users who run multiple PyTorch jobs concurrently on CPU probably use
numactl to set the particular CPU cores their PyTorch jobs can use, in order to saturate all cores of their system. Doing so also avoids slowdown due to OpenMP synchronization for operations that should not be parallelized.
Let’s consider an hypothetical scenario. Say, a user assigned 8 cores to one of their PyTorch jobs.
If their PyTorch job’s workload also entails operations that are better suited towards being run on fewer cores than they assigned, should they programmatically scale the number of threads before and after such operations with
torch.set_num_threads()? As in, would decreasing the number of OpenMP threads before such operations to run such operations on fewer cores and increasing them after those operations be efficient enough to amortize the thread destruction & creation cost? If yes, then a variable-sized thread-pool would automatically handle creating & destroying the number of threads in the OpenMP thread pool.
Because of issue 32008, the thread-pool was made fixed-sized again with PR 32875. The stated rationale for reverting to the
omp parallel if style was to avoid synchronization when
num_threads is 1. However, both
omp parallel if &
num_threads(num_threads) can be used together, and
num_threads(1) doesn’t lead to OpenMP being used anyway. Anyway, was the original rationale (the very first commit of
omp parallel if without using
num_threads was to use a fixed-sized thread pool?
Intuitively, it does seem to me that a fixed-sized thread pool would perform better than a variable-sized thread pool if
numactl are used to run PyTorch jobs, but the benchmarks on
pytorch/benchmarks are not compatible with the latest Python versions. I’m trying to learn about Systems design decisions of various systems & it’d be great to be able to learn why PyTorch initially had a fixed-sized OpenMP thread pool but for a period of time, was okay with a variable-sized thread-pool as well.