By default, PyTorch has a fixed-sized OpenMP thread-pool (equal to number of physical cores on a processor). Most PyTorch users who run multiple PyTorch jobs concurrently on CPU probably use taskset
and/or numactl
to set the particular CPU cores their PyTorch jobs can use, in order to saturate all cores of their system. Doing so also avoids slowdown due to OpenMP synchronization for operations that should not be parallelized.
Let’s consider an hypothetical scenario. Say, a user assigned 8 cores to one of their PyTorch jobs.
If their PyTorch job’s workload also entails operations that are better suited towards being run on fewer cores than they assigned, should they programmatically scale the number of threads before and after such operations with torch.set_num_threads()
? As in, would decreasing the number of OpenMP threads before such operations to run such operations on fewer cores and increasing them after those operations be efficient enough to amortize the thread destruction & creation cost? If yes, then a variable-sized thread-pool would automatically handle creating & destroying the number of threads in the OpenMP thread pool.
Talking about parallel_for
in PyTorch, due to issue 24080, PR 26886 automatically scaled down the number of threads in the OpenMP thread pool, thus creating a variable-sized thread pool.
Because of issue 32008, the thread-pool was made fixed-sized again with PR 32875. The stated rationale for reverting to the omp parallel if
style was to avoid synchronization when num_threads
is 1. However, both omp parallel if
& num_threads(num_threads)
can be used together, and num_threads(1)
doesn’t lead to OpenMP being used anyway. Anyway, was the original rationale (the very first commit of ParallelOpenMP.h
) for omp parallel if
without using num_threads
was to use a fixed-sized thread pool?
Intuitively, it does seem to me that a fixed-sized thread pool would perform better than a variable-sized thread pool if taskset
and/or numactl
are used to run PyTorch jobs, but the benchmarks on pytorch/benchmarks
are not compatible with the latest Python versions. I’m trying to learn about Systems design decisions of various systems & it’d be great to be able to learn why PyTorch initially had a fixed-sized OpenMP thread pool but for a period of time, was okay with a variable-sized thread-pool as well.