I’m using a small TSMixerx model to fit and predict on many different time series datasets. Each run can be done with a CPU in around 15 seconds or so. However, as the total volume of datasets are huge, I need to do this at scale - using Dask distributed and spawn as many processes to digest the workload in parallel. Each process instantiates new model instance and consume one of the datasets, and they don’t share any parameters, data, neither need to synchronize at any point to the extent of what I’ve known so far.
But the overall throughput does not scale well as I’ve expected. The profiler shows that model processing time increases, too as the number of parallel processes is increased:
# Processes | CPU Time |
---|---|
4 | 16.041s |
32 | 27.994s |
64 | 110.551s |
Sample output from the profiler:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Total MFLOPs
Optimizer.step#AdamW.step 18.65% 2.992s 98.68% 15.830s 28.782ms 550 --
aten::mm 10.73% 1.721s 10.74% 1.723s 94.615us 18210 48923.238
aten::copy_ 8.90% 1.428s 8.90% 1.428s 9.307us 153454 --
aten::bernoulli_ 8.32% 1.335s 9.71% 1.558s 257.523us 6050 --
aten::mul 8.05% 1.291s 8.07% 1.295s 68.906us 18796 1028.314
aten::index 7.10% 1.139s 7.17% 1.151s 517.339us 2224 --
aten::threshold_backward 4.84% 777.146ms 4.84% 777.146ms 201.856us 3850 --
aten::native_layer_norm 3.20% 513.762ms 4.83% 775.055ms 197.014us 3934 --
aten::add 3.13% 502.659ms 3.14% 504.489ms 68.938us 7318 362.784
aten::native_layer_norm_backward 2.96% 474.643ms 4.48% 718.816ms 186.706us 3850 --
Self CPU time total: 16.041s
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Total MFLOPs
Optimizer.step#AdamW.step 16.08% 4.501s 99.05% 27.727s 50.413ms 550 --
aten::bernoulli_ 10.79% 3.020s 12.28% 3.438s 568.335us 6050 --
aten::mm 9.09% 2.545s 9.10% 2.547s 139.870us 18210 41598.269
aten::copy_ 9.03% 2.529s 9.03% 2.529s 16.479us 153454 --
aten::mul 8.76% 2.451s 8.77% 2.456s 130.653us 18796 1028.314
aten::threshold_backward 4.21% 1.180s 4.21% 1.180s 306.455us 3850 --
aten::native_layer_norm 4.05% 1.134s 5.81% 1.626s 413.194us 3934 --
aten::native_layer_norm_backward 3.58% 1.003s 5.04% 1.412s 366.652us 3850 --
autograd::engine::evaluate_function: MulBackward0 3.48% 973.440ms 8.05% 2.254s 292.710us 7700 --
aten::add 3.33% 931.456ms 3.33% 933.552ms 127.569us 7318 345.888
Self CPU time total: 27.994s
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Total MFLOPs
Optimizer.step#AdamW.step 13.44% 14.861s 99.26% 109.733s 104.508ms 1050 --
aten::bernoulli_ 12.24% 13.529s 13.76% 15.213s 1.317ms 11550 --
aten::mul 9.32% 10.307s 9.33% 10.317s 287.564us 35879 1963.109
aten::copy_ 9.15% 10.115s 9.15% 10.115s 34.542us 292837 --
aten::mm 8.16% 9.024s 8.17% 9.028s 259.732us 34760 74743.071
autograd::engine::evaluate_function: MulBackward0 4.83% 5.339s 9.63% 10.648s 724.381us 14700 --
aten::native_layer_norm 4.58% 5.064s 6.39% 7.062s 941.159us 7504 --
autograd::engine::evaluate_function: AddmmBackward0 4.46% 4.928s 8.65% 9.559s 1.012ms 9450 --
autograd::engine::evaluate_function: ReluBackward0 4.07% 4.498s 7.30% 8.069s 1.098ms 7350 --
aten::native_layer_norm_backward 4.00% 4.423s 5.40% 5.967s 811.808us 7350 --
Self CPU time total: 110.551s
Observations:
- The per-process CPU time increases with the number of processes.
- The main contributors to CPU time are functions like
Optimizer.step#AdamW.step
,aten::mm
,aten::copy_
,aten::bernoulli_
, andaten::mul
. - The scaling behavior suggests overhead or contention as more processes are added.
Things I’ve checked / tried:
- CPU oversubscription: not saturated
- RAM: 50-70% usage at 64 parallelism
- IO wait: no because data is pre-fetched, and no such sign in the monitoring tool
- Set all
xxx_threads
to 1:
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
torch.set_num_interop_threads(1)
torch.set_num_threads(1)
print(torch.__config__.parallel_info())
ATen/Parallel:
at::get_num_threads() : 1
at::get_num_interop_threads() : 1
OpenMP 201511 (a.k.a. OpenMP 4.5)
omp_get_max_threads() : 1
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
mkl_get_max_threads() : 1
Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
std::thread::hardware_concurrency() : 192
Environment variables:
OMP_NUM_THREADS : 1
MKL_NUM_THREADS : 1
ATen parallel backend: OpenMP
- Built Pytorch from source and switched BLAS from MKL to openBLAS (as some search results show that MKL may not perform that well on non-intel CPUs. However I don’t know how to replace MKL completely at this moment
PyTorch built with:
- GCC 11.4
- C++ Version: 201703
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.5
- NVCC architecture flags: -gencode;arch=compute_89,code=sm_89
- CuDNN 90.6 (built against CUDA 12.6)
- Magma 2.8.0
- Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CUDA_VERSION=12.5, CUDNN_VERSION=9.6.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-march=native -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=open, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.5.0, USE_CUDA=1, USE_CUDNN=1, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKLDNN=0, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=1, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
- Used Zentorch plugin to compile the model. This seem a bit better but parallelism penalty is still notable. And SWAP kicks in seems due to the additional model compilation.
Questions:
- What could be causing the per-process CPU time to increase with more processes?
- Did I miss any best-practices for this usage scenario - running multiple instances of a small model in parallel, at scale?