Performance issue fitting multiple models with CPU in parallel

I’m using a small TSMixerx model to fit and predict on many different time series datasets. Each run can be done with a CPU in around 15 seconds or so. However, as the total volume of datasets are huge, I need to do this at scale - using Dask distributed and spawn as many processes to digest the workload in parallel. Each process instantiates new model instance and consume one of the datasets, and they don’t share any parameters, data, neither need to synchronize at any point to the extent of what I’ve known so far.

But the overall throughput does not scale well as I’ve expected. The profiler shows that model processing time increases, too as the number of parallel processes is increased:

# Processes CPU Time
4 16.041s
32 27.994s
64 110.551s

Sample output from the profiler:


                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  Total MFLOPs

                          Optimizer.step#AdamW.step        18.65%        2.992s        98.68%       15.830s      28.782ms           550            --
                                           aten::mm        10.73%        1.721s        10.74%        1.723s      94.615us         18210     48923.238
                                        aten::copy_         8.90%        1.428s         8.90%        1.428s       9.307us        153454            --
                                   aten::bernoulli_         8.32%        1.335s         9.71%        1.558s     257.523us          6050            --
                                          aten::mul         8.05%        1.291s         8.07%        1.295s      68.906us         18796      1028.314
                                        aten::index         7.10%        1.139s         7.17%        1.151s     517.339us          2224            --
                           aten::threshold_backward         4.84%     777.146ms         4.84%     777.146ms     201.856us          3850            --
                            aten::native_layer_norm         3.20%     513.762ms         4.83%     775.055ms     197.014us          3934            --
                                          aten::add         3.13%     502.659ms         3.14%     504.489ms      68.938us          7318       362.784
                   aten::native_layer_norm_backward         2.96%     474.643ms         4.48%     718.816ms     186.706us          3850            --

Self CPU time total: 16.041s


                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  Total MFLOPs

                          Optimizer.step#AdamW.step        16.08%        4.501s        99.05%       27.727s      50.413ms           550            --
                                   aten::bernoulli_        10.79%        3.020s        12.28%        3.438s     568.335us          6050            --
                                           aten::mm         9.09%        2.545s         9.10%        2.547s     139.870us         18210     41598.269
                                        aten::copy_         9.03%        2.529s         9.03%        2.529s      16.479us        153454            --
                                          aten::mul         8.76%        2.451s         8.77%        2.456s     130.653us         18796      1028.314
                           aten::threshold_backward         4.21%        1.180s         4.21%        1.180s     306.455us          3850            --
                            aten::native_layer_norm         4.05%        1.134s         5.81%        1.626s     413.194us          3934            --
                   aten::native_layer_norm_backward         3.58%        1.003s         5.04%        1.412s     366.652us          3850            --
  autograd::engine::evaluate_function: MulBackward0         3.48%     973.440ms         8.05%        2.254s     292.710us          7700            --
                                          aten::add         3.33%     931.456ms         3.33%     933.552ms     127.569us          7318       345.888

Self CPU time total: 27.994s


                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  Total MFLOPs

                          Optimizer.step#AdamW.step        13.44%       14.861s        99.26%      109.733s     104.508ms          1050            --
                                   aten::bernoulli_        12.24%       13.529s        13.76%       15.213s       1.317ms         11550            --
                                          aten::mul         9.32%       10.307s         9.33%       10.317s     287.564us         35879      1963.109
                                        aten::copy_         9.15%       10.115s         9.15%       10.115s      34.542us        292837            --
                                           aten::mm         8.16%        9.024s         8.17%        9.028s     259.732us         34760     74743.071
  autograd::engine::evaluate_function: MulBackward0         4.83%        5.339s         9.63%       10.648s     724.381us         14700            --
                            aten::native_layer_norm         4.58%        5.064s         6.39%        7.062s     941.159us          7504            --
autograd::engine::evaluate_function: AddmmBackward0         4.46%        4.928s         8.65%        9.559s       1.012ms          9450            --
 autograd::engine::evaluate_function: ReluBackward0         4.07%        4.498s         7.30%        8.069s       1.098ms          7350            --
                   aten::native_layer_norm_backward         4.00%        4.423s         5.40%        5.967s     811.808us          7350            --

Self CPU time total: 110.551s

Observations:

  • The per-process CPU time increases with the number of processes.
  • The main contributors to CPU time are functions like Optimizer.step#AdamW.step, aten::mm, aten::copy_, aten::bernoulli_, and aten::mul.
  • The scaling behavior suggests overhead or contention as more processes are added.

Things I’ve checked / tried:

  • CPU oversubscription: not saturated
  • RAM: 50-70% usage at 64 parallelism
  • IO wait: no because data is pre-fetched, and no such sign in the monitoring tool
  • Set all xxx_threads to 1:
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

torch.set_num_interop_threads(1)
torch.set_num_threads(1) 

print(torch.__config__.parallel_info())
ATen/Parallel:
        at::get_num_threads() : 1
        at::get_num_interop_threads() : 1
OpenMP 201511 (a.k.a. OpenMP 4.5)
        omp_get_max_threads() : 1
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
        mkl_get_max_threads() : 1
Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
std::thread::hardware_concurrency() : 192
Environment variables:
        OMP_NUM_THREADS : 1
        MKL_NUM_THREADS : 1
ATen parallel backend: OpenMP
  • Built Pytorch from source and switched BLAS from MKL to openBLAS (as some search results show that MKL may not perform that well on non-intel CPUs. However I don’t know how to replace MKL completely at this moment

PyTorch built with:

  • GCC 11.4
  • C++ Version: 201703
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 12.5
  • NVCC architecture flags: -gencode;arch=compute_89,code=sm_89
  • CuDNN 90.6 (built against CUDA 12.6)
  • Magma 2.8.0
  • Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CUDA_VERSION=12.5, CUDNN_VERSION=9.6.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-march=native -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=open, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.5.0, USE_CUDA=1, USE_CUDNN=1, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKLDNN=0, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
  • Used Zentorch plugin to compile the model. This seem a bit better but parallelism penalty is still notable. And SWAP kicks in seems due to the additional model compilation.

Questions:

  • What could be causing the per-process CPU time to increase with more processes?
  • Did I miss any best-practices for this usage scenario - running multiple instances of a small model in parallel, at scale?

I compared the build options between v2.4.1 (installed via conda) and v2.5.1 (installed via pip), it seems the latter is missing this one PERF_WITH_AVX512=1

v2.4.1 installed via conda
PyTorch built with:

  • GCC 9.3
  • C++ Version: 201703
  • Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 12.4
  • NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  • CuDNN 90.1
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

v2.5.1 installed via pip
PyTorch built with:

  • GCC 9.3
  • C++ Version: 201703
  • Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 12.4
  • NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  • CuDNN 90.1
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOB ILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.5.1, USE_CUDA=ON, USE_CUDNN=ON,USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,