Inference speed discrepancies in torch serve

Hello,

Odd question but I’m hitting a wall.

I have two very similar models trained to two slightly different objectives. The architectures are nearly identical. The only difference is one small linear layer. Using jit trace I freeze both models and deploy them using torch serve. One model runs 8x faster than the other. In a notebook, loading both jit models they run at the same speed and the profiles are very similar. See below.

Any ideas?

With respect to the timing on torch serve:

### Model 2
inference_1    | 2023-08-03T18:29:28,566 [INFO ] W-9054-XXX-1.0.0-stdout MODEL_LOG - Preprocessing time: 0.02347397804260254
inference_1    | 2023-08-03T18:29:28,566 [INFO ] W-9054-XXX-1.0.0-stdout MODEL_LOG - Inference time: 0.4965062141418457
inference_1    | 2023-08-03T18:29:28,566 [INFO ] W-9054-XXX-1.0.0-stdout MODEL_LOG - Inference time: 0.44716620445251465
inference_1    | 2023-08-03T18:29:28,566 [INFO ] W-9054-XXX-1.0.0-stdout MODEL_LOG - Inference time: 0.23859477043151855
inference_1    | 2023-08-03T18:29:28,566 [INFO ] W-9054-XXX-1.0.0-stdout MODEL_LOG - Inference time: 0.23617815971374512
inference_1    | 2023-08-03T18:29:28,566 [INFO ] W-9054-XXX-1.0.0-stdout MODEL_LOG - Inference time: 0.22520685195922852
inference_1    | 2023-08-03T18:29:28,566 [INFO ] W-9054-XXX-1.0.0-stdout MODEL_LOG - Postprocessing time: 0.00019669532775878906


### Model 1 
inference_1    | 2023-08-03T18:30:25,378 [INFO ] W-9063-yyy-stdout MODEL_LOG - Preprocessing time: 0.02209949493408203
inference_1    | 2023-08-03T18:30:25,379 [INFO ] W-9063-yyy-stdout MODEL_LOG - Inference time: 4.135310888290405
inference_1    | 2023-08-03T18:30:25,379 [INFO ] W-9063-yyy-stdout MODEL_LOG - Inference time: 4.075218915939331
inference_1    | 2023-08-03T18:30:25,379 [INFO ] W-9063-yyy-stdout MODEL_LOG - Inference time: 3.8226993083953857
inference_1    | 2023-08-03T18:30:25,379 [INFO ] W-9063-yyy-stdout MODEL_LOG - Inference time: 3.8330533504486084
inference_1    | 2023-08-03T18:30:25,379 [INFO ] W-9063-yyy-stdout MODEL_LOG - Inference time: 3.826336145401001
inference_1    | 2023-08-03T18:30:25,379 [INFO ] W-9063-yyy-stdout MODEL_LOG - Postprocessing time: 0.000415802001953125

Below are profiles from when the models are executed in a notebook.
Model 1:

--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 model_inference         0.99%     214.000us        99.29%      21.511ms      21.511ms             1  
                         forward         6.25%       1.353ms        98.27%      21.291ms      21.291ms             1  
                    aten::linear         2.08%     450.000us        38.81%       8.409ms     158.660us            53  
              aten::_convolution         2.90%     628.000us        25.02%       5.420ms     258.095us            21  
                    aten::matmul         1.86%     402.000us        19.29%       4.180ms     181.739us            23  
        aten::mkldnn_convolution        16.74%       3.627ms        17.25%       3.738ms     196.737us            19  
                       aten::bmm        14.09%       3.053ms        14.09%       3.053ms     138.773us            22  
                     aten::addmm        12.10%       2.621ms        13.68%       2.963ms      92.594us            32  
                      aten::gelu         6.73%       1.459ms         6.73%       1.459ms      91.188us            16  
                       aten::mul         6.04%       1.308ms         6.04%       1.308ms      81.750us            16  
                aten::batch_norm         0.18%      38.000us         5.81%       1.259ms     157.375us             8  
    aten::_batch_norm_impl_index         0.24%      53.000us         5.64%       1.221ms     152.625us             8  
         aten::native_batch_norm         3.95%     856.000us         5.35%       1.158ms     144.750us             8  
                aten::group_norm         0.36%      77.000us         4.06%     880.000us      55.000us            16  
         aten::native_group_norm         3.41%     739.000us         3.71%     803.000us      50.188us            16  
                     aten::copy_         3.48%     754.000us         3.48%     754.000us      13.709us            55  
                       aten::add         2.59%     562.000us         3.03%     657.000us      26.280us            25  
                aten::contiguous         0.12%      26.000us         2.88%     625.000us      48.077us            13  
                     aten::clone         0.33%      71.000us         2.76%     599.000us      46.077us            13  
                         aten::t         1.19%     258.000us         2.12%     460.000us       8.679us            53  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 21.665ms

STAGE:2023-08-03 17:57:09 11773:11773 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
STAGE:2023-08-03 17:57:09 11773:11773 ActivityProfilerController.cpp:300] Completed Stage: Collection

Model 2:

--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 model_inference         5.76%       1.453ms        99.87%      25.213ms      25.213ms             1  
                         forward         8.52%       2.152ms        94.09%      23.754ms      23.754ms             1  
                    <forward op>         1.65%     416.000us        41.59%      10.500ms     500.000us            21  
                    aten::linear         1.53%     387.000us        31.52%       7.958ms     153.038us            52  
              aten::_convolution         2.16%     545.000us        28.54%       7.205ms     343.095us            21  
                    aten::matmul         1.56%     394.000us        14.87%       3.755ms     163.261us            23  
        aten::mkldnn_convolution        13.77%       3.477ms        14.18%       3.579ms     188.368us            19  
                     aten::addmm        10.43%       2.633ms        12.07%       3.046ms      98.258us            31  
                       aten::bmm        10.70%       2.702ms        10.70%       2.702ms     122.818us            22  
               aten::thnn_conv2d         0.04%       9.000us         8.58%       2.167ms       1.083ms             2  
      aten::_slow_conv2d_forward         8.40%       2.120ms         8.55%       2.158ms       1.079ms             2  
                      aten::gelu         5.75%       1.452ms         5.75%       1.452ms      90.750us            16  
    aten::_batch_norm_impl_index         0.25%      63.000us         5.01%       1.265ms     140.556us             9  
         aten::native_batch_norm         3.52%     888.000us         4.72%       1.191ms     132.333us             9  
                      aten::add_         4.22%       1.065ms         4.22%       1.065ms      30.429us            35  
                aten::group_norm         0.27%      69.000us         3.95%     997.000us      62.312us            16  
                aten::batch_norm         0.13%      33.000us         3.81%     961.000us     192.200us             5  
         aten::native_group_norm         3.45%     871.000us         3.68%     928.000us      58.000us            16  
                     aten::copy_         3.42%     864.000us         3.42%     864.000us      16.000us            54  
                aten::contiguous         0.33%      83.000us         2.78%     702.000us      54.000us            13  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 25.245ms

STAGE:2023-08-03 17:30:03 11773:11773 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
STAGE:2023-08-03 17:30:03 11773:11773 ActivityProfilerController.cpp:300] Completed Stage: Collection

Your CPU performance will be heavily influenced by launcher core pinning Grokking PyTorch Intel CPU performance from first principles — PyTorch Tutorials 2.0.1+cu117 documentation so highly recommend you

What does your config.properties look like? Make sure to add these 2 lines

cpu_launcher_enable=true
cpu_launcher_args=--use_logical_core

Also as you’re deploying both models are you deploying both together in the same torch serve instance?

Could you first collect the perf data as a baseline by running the two models without TorchServe? And then compare them with the above data

That did it. Thanks for the feedback.

The core pinning? Sweet!

For sure.

I’m trying to optimize the performance a bit more, so I switched to jemalloc. It runs but I get the following warnings

<jemalloc>: Invalid conf pair: oversize_threshold:1
<jemalloc>: Invalid conf pair: background_thread:true
<jemalloc>: Invalid conf pair: metadata_thp:auto

Do you think there is a versioning issue between jemallox and the launcher?

Thanks. :pray:

More logs below.

2023-08-16T00:01:48,190 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - 2023-08-16 00:01:48,190 - __main__ - INFO - Use 'jemalloc' memory allocator.
2023-08-16T00:01:48,190 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - 2023-08-16 00:01:48,190 - __main__ - INFO - Use 'auto' => 'intel' OpenMP runtime.
2023-08-16T00:01:48,202 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - 2023-08-16 00:01:48,202 - __main__ - INFO - Use 'auto' => 'taskset' multi-task manager.
2023-08-16T00:01:48,202 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - 2023-08-16 00:01:48,202 - __main__ - INFO - env: Untouched preset environment variables are not displayed.
2023-08-16T00:01:48,202 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - 2023-08-16 00:01:48,202 - __main__ - INFO - env: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so:/usr/lib/x86_64-linux-gnu/libiomp5.so:
2023-08-16T00:01:48,202 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - 2023-08-16 00:01:48,202 - __main__ - INFO - env: MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto
2023-08-16T00:01:48,203 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - 2023-08-16 00:01:48,202 - __main__ - INFO - env: KMP_BLOCKTIME=1
2023-08-16T00:01:48,203 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - 2023-08-16 00:01:48,203 - __main__ - INFO - env: OMP_NUM_THREADS=8
2023-08-16T00:01:48,203 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - 2023-08-16 00:01:48,203 - __main__ - INFO - cmd: taskset -c 0-7 /home/venv/bin/python3 -u /home/venv/lib/python3.8/site-packages/ts/model_service_worker.py --sock-type unix --sock-name /tmp/.ts.sock.9035
2023-08-16T00:01:48,208 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - <jemalloc>: Invalid conf pair: oversize_threshold:1
2023-08-16T00:01:48,208 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - <jemalloc>: Invalid conf pair: background_thread:true
2023-08-16T00:01:48,208 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - <jemalloc>: Invalid conf pair: metadata_thp:auto
2023-08-16T00:01:48,209 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - <jemalloc>: Invalid conf pair: oversize_threshold:1
2023-08-16T00:01:48,209 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - <jemalloc>: Invalid conf pair: background_thread:true
2023-08-16T00:01:48,209 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - <jemalloc>: Invalid conf pair: metadata_thp:auto
2023-08-16T00:01:48,210 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - <jemalloc>: Invalid conf pair: oversize_threshold:1
2023-08-16T00:01:48,210 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - <jemalloc>: Invalid conf pair: background_thread:true
2023-08-16T00:01:48,210 [WARN ] W-9035-model_1.0.0-stderr MODEL_LOG - <jemalloc>: Invalid conf pair: metadata_thp:auto

Sorry just saw this do you mind posting this on our github issues and tagging min-jean-cho?

EDIT: nvm just made an issue myself here jemalloc error · Issue #2582 · pytorch/serve · GitHub

1 Like

Hi, sorry to revive an old thread, but I’m facing what seems like a very similar issue and I can’t find anything else online which seems related. However, I have those options in my config.properties already; are you sure that nothing else changed in your configuration?

I have two IResNet100-based models, which are identical except for the final layer. One produces a 512-d embedding, and the other a 256-d embedding. I enabled the profiler in torchserve, and you can see that the same 58 calls of aten::mkldnn_convolution are made with the same layer sizes within the backbones of both models, but the evaluation time is many times faster for the 256-d embedding model.

256-d embedding model:

2024-02-26T18:08:59,922 [INFO ] W-9000-256d-model-stdout MODEL_LOG - ---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
2024-02-26T18:08:59,923 [INFO ] W-9000-256d-model-stdout MODEL_LOG -                              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
2024-02-26T18:08:59,923 [INFO ] W-9000-256d-model-stdout MODEL_LOG - ---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
2024-02-26T18:08:59,923 [INFO ] W-9000-256d-model-stdout MODEL_LOG -                        preprocess        20.18%     466.460ms        58.49%        1.352s        1.352s             1                                                                                []  
2024-02-26T18:08:59,924 [INFO ] W-9000-256d-model-stdout MODEL_LOG -                         inference        17.60%     406.942ms        41.50%     959.438ms     959.438ms             1                                                                                []  
2024-02-26T18:08:59,924 [INFO ] W-9000-256d-model-stdout MODEL_LOG -                           forward         1.92%      44.313ms        36.01%     832.570ms     832.570ms             1                                                        [[], [2, 3, 600, 600], []]  
2024-02-26T18:08:59,924 [INFO ] W-9000-256d-model-stdout MODEL_LOG -                           forward         0.72%      16.601ms        23.90%     552.496ms     552.496ms             1                                                            [[], [2, 3, 112, 112]]  
2024-02-26T18:08:59,924 [INFO ] W-9000-256d-model-stdout MODEL_LOG -                      aten::linear         0.00%      80.000us        20.54%     474.834ms     474.834ms             1                                                   [[123, 576], [128, 576], [128]]  
2024-02-26T18:08:59,925 [INFO ] W-9000-256d-model-stdout MODEL_LOG -                       aten::addmm        20.53%     474.656ms        20.53%     474.699ms     474.699ms             1                                           [[128], [123, 576], [576, 128], [], []]  
2024-02-26T18:08:59,925 [INFO ] W-9000-256d-model-stdout MODEL_LOG -                aten::_convolution         0.08%       1.926ms        10.28%     237.587ms       4.096ms            58  [[2, 256, 14, 14], [256, 256, 3, 3], [], [], [], [], [], [], [], [], [], [], []]  
2024-02-26T18:08:59,925 [INFO ] W-9000-256d-model-stdout MODEL_LOG -          aten::mkldnn_convolution        10.15%     234.589ms        10.19%     235.661ms       4.063ms            58                          [[2, 256, 14, 14], [256, 256, 3, 3], [], [], [], [], []]  
2024-02-26T18:08:59,925 [INFO ] W-9000-256d-model-stdout MODEL_LOG -                aten::_convolution         0.15%       3.433ms         4.31%      99.637ms       4.152ms            24  [[2, 128, 28, 28], [128, 128, 3, 3], [], [], [], [], [], [], [], [], [], [], []]  
2024-02-26T18:08:59,926 [INFO ] W-9000-256d-model-stdout MODEL_LOG -          aten::mkldnn_convolution         4.14%      95.738ms         4.16%      96.204ms       4.008ms            24                          [[2, 128, 28, 28], [128, 128, 3, 3], [], [], [], [], []]  
2024-02-26T18:08:59,926 [INFO ] W-9000-256d-model-stdout MODEL_LOG - ---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  

512-d embedding model:

2024-02-26T18:09:22,273 [INFO ] W-9001-512d-model-stdout MODEL_LOG - ---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
2024-02-26T18:09:22,274 [INFO ] W-9001-512d-model-stdout MODEL_LOG -                              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
2024-02-26T18:09:22,274 [INFO ] W-9001-512d-model-stdout MODEL_LOG - ---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
2024-02-26T18:09:22,275 [INFO ] W-9001-512d-model-stdout MODEL_LOG -                         inference         3.04%     439.422ms        91.95%       13.272s       13.272s             1                                                                                []  
2024-02-26T18:09:22,275 [INFO ] W-9001-512d-model-stdout MODEL_LOG -                           forward         0.15%      21.308ms        88.90%       12.833s       12.833s             1                                                            [[], [2, 3, 112, 112]]  
2024-02-26T18:09:22,276 [INFO ] W-9001-512d-model-stdout MODEL_LOG -                aten::_convolution         0.02%       2.636ms        65.60%        9.469s     163.266ms            58  [[2, 256, 14, 14], [256, 256, 3, 3], [], [], [], [], [], [], [], [], [], [], []]  
2024-02-26T18:09:22,276 [INFO ] W-9001-512d-model-stdout MODEL_LOG -          aten::mkldnn_convolution        65.57%        9.465s        65.58%        9.467s     163.221ms            58                          [[2, 256, 14, 14], [256, 256, 3, 3], [], [], [], [], []]  
2024-02-26T18:09:22,276 [INFO ] W-9001-512d-model-stdout MODEL_LOG -                        preprocess         2.65%     381.819ms         8.05%        1.162s        1.162s             1                                                                                []  
2024-02-26T18:09:22,277 [INFO ] W-9001-512d-model-stdout MODEL_LOG -                aten::_convolution         0.02%       2.317ms         7.78%        1.123s     561.535ms             2   [[2, 64, 112, 112], [64, 64, 3, 3], [], [], [], [], [], [], [], [], [], [], []]  
2024-02-26T18:09:22,278 [INFO ] W-9001-512d-model-stdout MODEL_LOG -          aten::mkldnn_convolution         7.76%        1.121s         7.76%        1.121s     560.377ms             2                           [[2, 64, 112, 112], [64, 64, 3, 3], [], [], [], [], []]  
2024-02-26T18:09:22,278 [INFO ] W-9001-512d-model-stdout MODEL_LOG -                aten::_convolution         0.00%     155.000us         6.05%     873.714ms     218.429ms             4    [[2, 512, 7, 7], [512, 512, 3, 3], [], [], [], [], [], [], [], [], [], [], []]  
2024-02-26T18:09:22,279 [INFO ] W-9001-512d-model-stdout MODEL_LOG -          aten::mkldnn_convolution         6.05%     873.462ms         6.05%     873.559ms     218.390ms             4                            [[2, 512, 7, 7], [512, 512, 3, 3], [], [], [], [], []]  
2024-02-26T18:09:22,279 [INFO ] W-9001-512d-model-stdout MODEL_LOG -                           forward         0.37%      53.022ms         5.17%     746.691ms     746.691ms             1                                                        [[], [2, 3, 600, 600], []]  
2024-02-26T18:09:22,279 [INFO ] W-9001-512d-model-stdout MODEL_LOG - ---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------------------------------- 

These stats were collected while the models were running through the same instance of torchserve (so using identical settings), on my dev machine (an Intel MacBook), but I see the same behaviour when the models are deployed on a CPU-only Linux server.

I’ve managed to resolve this issue. For anyone else who stumbled across this, it was due to denormal numbers in the weights of the larger model. This was significantly effecting the CPU inference times, but not the GPU inference or training times, which was why I hadn’t noticed it previously.

The solution was simply to add torch.set_flush_denormal(True) to the __init__ method of the handler.